Python 3.7 urllib.request reurns &nbsp instead of content - python

So I made a code that reads and prints everything in between specified text in HTML code, example , reads all between paragraphs<> - this gets printed.
This was from sentdex lesson - here
There is no problem with code, but rather with what is coming out.
I filtered with very specific criteria
paragraphs = re.findall(r'<div style="font-size: 23px; margin-top: 20px;" class="jsdfx-sentiment-present">(.*?)</div>',str(respData))
So as already mentioned, it works. Content later is printed and it prints
&nbsp
. As I understand this is non-braking space in HTML. Instead of space I expected to see numbers. In this website , numbers in this location are updating every few seconds.
How can I get to these numbers instead of receiving &nbsp?
Regards!

It depends on how exactly you're downloading the page, and from where, but because you say the value changes constantly when looking at it in a web browser, I'd wager that when you download the page, that &nbsp is exactly what's inside that div - and the page changes it on-the-fly via javascript or something while you're actually viewing the page. Your tutorial uses a static tag, one that's the same every time you load the page, rather than one that gets dynamically set after the page is already active.
It's fairly common to do this in web development for dynamic values - put a placeholder value in a div, and then dynamically edit the content as is appropriate. If course, if you just take a snapshot of the page (and even moreso if you take that snapshot before the javascript code and whatnot that would have filled in that value has had a chance to run) you're not going to see the change, and you get only the default value, without the number being filled in.
Based on the tutorial you linked, you're probably using urllib. If you want to get dynamic content from a HTML page, that's probably not the best tool to use - you should look into selenium and BeautifulSoup. This StackOverflow Answer goes into a lot more detail on effective solutions to this problem.

Related

Impossible to recover some information with Beautifulsoup on a site

I need your help because I have for the first time problems to get some information with Beautifulsoup .
I have two problems on this page
The green button GET COUPON CODE appear after a few moment see GIF capture
When we inspect the button link, we find a a simple href attribute that call to an out.php function that performs the opening of the destination link that I am trying to capture.
GET COUPON CODE
Thank you for your help
Your problem is a little unclear but if I understand correctly, your first problem is that the 'get coupon code' button looks like this when you render the HTML that you receive from the original page request.
The mark-up for a lot of this code is rendered dynamically using javascript. So that button is missing its href value until it gets loaded in later. You would need to also run the javascript on that page to render this after the initial request. You can't really get this easily using just the python requests library and BeautifulSoup. It will be a lot easier if you use Selenium too which lets you control a browser so it runs all that javascript for you and then you can just get the button info a couple of seconds after loading the page.
There is a way to do this all with plain requests, but it's a bit tedious. You would need to read through the requests the page makes and figure out which one gets the link for the button. The upside to this is it would cut the number of steps to get the info you need and the amount of time it takes to get. You could just use this new request every time to get the right PHP link then just get the info from there.
For your second point, I'm also not sure if I answered it already, but maybe you're also trying to get the redirect link from that PHP link. From inspecting the network requests, it looks like the info will be found in the response headers, there is no body to inspect.
(I know it says 'from cache' but the point is that the redirect is being caused by the header info)

How do I check if a website is responsive using python?

I am using python3 in combination with beautifulsoup.
I want to check if a website is responsive or not. First I thought checking the meta tags of a website and see if there is something like this in it:
content="width=device-width, initial-scale=1.0
Accuracy is not that good using this method but I have not found something better.
Has anybody an idea?
Basically I want to do the same as Google did it here: https://search.google.com/test/mobile-friendly reduced to the output if the website is responsive or not (Y/N)
(Just a suggestion)
I am not an expert on this but my first thought is that you need to render the website and see if it "responds" to different screen sizes. I would normally use something like phantomjs to do this.
Apparently, you can do this in python with selenium (more info at https://stackoverflow.com/a/15699761/3727050). A more comprehensive list of technologies that can be used for this task can be found here. Note that these resources seem a bit old/outdated and some solutions fallback to python subprocess calling phantomjs.
The linked google test seems to
Load the page in a small browser and check:
The font-size to be readable
The distance between clickable elements to ensure the page is usable
I would however do the following:
Load the page in desktop mode, record each div's style.
Gradually reduce the size of the screen and see which percentage of these change style
In most cases, from a large screen to a phone size you should be seeing 1-3 distinct layouts which should be identifiable from the percentage of elements changing style
The above does not guarantee that the page is "mobile-friendly" (ie usable in a mobile) but it shows if the CSS are responsive.

How to get visible text from a webpage using Selenium & python?

I am trying to grab a bunch numbers that are presented in a table on a web page that I’ve accessed using python and Selenium running headless on a Raspberry Pi. The numbers are not in the page source, rather they are deeply embedded in complex html served by several URLs called by the main page (the numbers update every few seconds). I know I could parse the html to get the numbers I want, but the numbers are already sitting on the front page in perfect format all in one place. I can select and copy the numbers when I view the web page in Chrome on my PC.
How can I use python and get Selenium webdriver to get me those numbers? Can Selenium simply provide all the visible text on a page? How? (I've tried driver.page_source but the text returned does not contain the numbers). Or is there a way to essentially copy text and numbers from a table visible on the screen using python and Selenium? (I’ve looked into xdotool but didn’t find enough documentation to help). I’m just learning Selenium so any suggestions will be much appreciated!
Well, I figured out the answer to my question. It's embarrassingly easy. This line gets just what I need - all the text that is visible on the web page:
page_text = driver.find_element_by_tag_name('body').text
So, there are some different situations why you can not get some info on the page:
Information doesn't loaded yet. You must waiting for some time to get your information ready. You may watch this theme for the better understanding. Some times you get dynamically added page elements with JS and so on, which loading is very slowly.
Information may consists of different type of data. For example you are waiting for a text with numbers, but you may get picture with numbers on the page. In this situation you must change your programming tactics and use another functions to get what you need.

The HTML code I scrape seems to be incomplete in comparison to the full website. Could the HTML be changing dynamically?

I am currently scraping a website for work to be able to sort the data locally, however when I do this the code seems to be incomplete, and I feel may be changing whilst I scroll on the website to add more content. Can this happen ? And if so, how can I ensure I am able to scrape the whole website for processing?
I only currently know some python and html for web scraping, looking into what other elements may be affecting this issue (javascript or ReactJS etc).
I am expecting to get a list of 50 names when scraping the website, but it only returns 13. I have downloaded the whole HTML file to go through it and none of the other names seem to exist in the file, i.e. why I think the file may be changing dynamically
Yes, the content of the HTML can be dynamic, and Javascript loading should be the most essential . For Python, scrapy+splash maybe a good choice to get started.
Depending on how the data is handled, you can have different methods to handle dyamic content HTML

Scrape table from html (<tr> and ID method not working)

I'm currently trying to do a web scrape of a table from this website: http://pusatdata.kontan.co.id/reksadana/produk/469/Schroder-90-Plus-Equity-Fund
Specifically the grey table with the headers "TANGGAL/NAB/DIVIDEN/DAILY RETURN (%)".
Below is the code that I use:
import requests
import urllib.request
from bs4 import BeautifulSoup
quote_page = "http://pusatdata.kontan.co.id/reksadana/produk/469/Schroder-90-Plus-Equity-Fund"
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
table = soup.find('div',id='table_flex')
print (table.text)
But no output was generated at all. Appreciate your help. Thank you very much!
When you don't get the results you expect from your code, you need to backtrack to figure out where your code broke.
In this case, your first step would be to check the value of table. As it turns out, table is not None (which would signify a bad selector/soup.find call), so you at least know that you got that much right.
Instead, what you'll notice is that the table_flex div is empty. This isn't terribly surprising to me, but let's pretend I don't know anything and this doesn't make any sense. So the next step would be to pull up a browser and to double check that the DOM (via your browser's inspect tool) has content in the table_flex div.
It does, so now you have to do some real digging. If you run a simple search on the DOM in the inspect window for "table_flex", you'll first see the div that we already know about, but then you'll see some Javascript/jQuery further down the page that references "#table_flex".
This Javascript is part of a $.ajax() call (which you would google and find out is basically a query to a webserver for information). You'll also note that $("#table_flex") has an html() method (which, after more googling, you find out sets the html content for a particular element).
And now we have your answer for why the div is empty: when the webserver is queried for that page, the server sends back a document that has a blank table. The querying party is then expected to execute the Javascript to fill in the rest of the page. Generally speaking, Python modules don't run Javascript (for several reasons), so the table never gets populated.
This tends to be standard operating procedure for dynamic content, as "template" webpages can be cached and quickly distributed (since no additional information is needed) and then the rest of the information is supplied as the user needs it. This can also allow the same document to be used for multiple urls and query arguments without having to generate new documents.
Ultimately, what will probably be easiest for you is to determine whether you can access that API directly yourself and simply query that url instead.
There was no ouput generated because there is no text within the <div> with the id table_flex. So this shouldn't be a surprise.
The ”table” in question can be found under a <div> with the id manajemen_reksadana. The two rows are not directly under that <div> and the whole ”table” is made of <div>s, so it's best to navigate to the known header/label texts, and address the <div> containing the value relative to the <div> with the header/label text:
fund_management_node = soup.find('div', id='manajemen_reksadana')
for label_text in ['PRODUK', 'KATEGORI', 'NAB', 'DAILY RETURN']:
label_node = fund_management_node.find(text=label_text).parent
print(label_node.find_next_sibling('div').text)

Categories

Resources