Scrape table from html (<tr> and ID method not working) - python

I'm currently trying to do a web scrape of a table from this website: http://pusatdata.kontan.co.id/reksadana/produk/469/Schroder-90-Plus-Equity-Fund
Specifically the grey table with the headers "TANGGAL/NAB/DIVIDEN/DAILY RETURN (%)".
Below is the code that I use:
import requests
import urllib.request
from bs4 import BeautifulSoup
quote_page = "http://pusatdata.kontan.co.id/reksadana/produk/469/Schroder-90-Plus-Equity-Fund"
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
table = soup.find('div',id='table_flex')
print (table.text)
But no output was generated at all. Appreciate your help. Thank you very much!

When you don't get the results you expect from your code, you need to backtrack to figure out where your code broke.
In this case, your first step would be to check the value of table. As it turns out, table is not None (which would signify a bad selector/soup.find call), so you at least know that you got that much right.
Instead, what you'll notice is that the table_flex div is empty. This isn't terribly surprising to me, but let's pretend I don't know anything and this doesn't make any sense. So the next step would be to pull up a browser and to double check that the DOM (via your browser's inspect tool) has content in the table_flex div.
It does, so now you have to do some real digging. If you run a simple search on the DOM in the inspect window for "table_flex", you'll first see the div that we already know about, but then you'll see some Javascript/jQuery further down the page that references "#table_flex".
This Javascript is part of a $.ajax() call (which you would google and find out is basically a query to a webserver for information). You'll also note that $("#table_flex") has an html() method (which, after more googling, you find out sets the html content for a particular element).
And now we have your answer for why the div is empty: when the webserver is queried for that page, the server sends back a document that has a blank table. The querying party is then expected to execute the Javascript to fill in the rest of the page. Generally speaking, Python modules don't run Javascript (for several reasons), so the table never gets populated.
This tends to be standard operating procedure for dynamic content, as "template" webpages can be cached and quickly distributed (since no additional information is needed) and then the rest of the information is supplied as the user needs it. This can also allow the same document to be used for multiple urls and query arguments without having to generate new documents.
Ultimately, what will probably be easiest for you is to determine whether you can access that API directly yourself and simply query that url instead.

There was no ouput generated because there is no text within the <div> with the id table_flex. So this shouldn't be a surprise.
The ”table” in question can be found under a <div> with the id manajemen_reksadana. The two rows are not directly under that <div> and the whole ”table” is made of <div>s, so it's best to navigate to the known header/label texts, and address the <div> containing the value relative to the <div> with the header/label text:
fund_management_node = soup.find('div', id='manajemen_reksadana')
for label_text in ['PRODUK', 'KATEGORI', 'NAB', 'DAILY RETURN']:
label_node = fund_management_node.find(text=label_text).parent
print(label_node.find_next_sibling('div').text)

Related

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

Scrapy SgmlLinkExtractor how to define XPath

I want to retreive the cityname and citycode and store it in one string variable. The image shows the precise location:
Google Chrome gave me the following XPath:
//*[#id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span
So I defined the following statement in scrapy to get the desired information:
plz = response.xpath('//*[#id="page"]/main/div[4]/div[2]/div[1]/div/div/div[1]/div[2]/div/div[1]/div/a[1]/span/text()').extract()
However I was not successful, the string remains empty. What XPath definition should I use instead?
Most of the time this occurs, this is because browsers correct invalid HTML. How do you fix this? Inspect the (raw) HTML source and write your own XPath that navigate the DOM with the shortest/simplest query.
I scrape a lot of data off of the web and I've never used an XPath as specific as the one you got from the browser. This is for a few reasons:
It will fail quickly on invalid HTML or the most basic of hierarchy changes.
It contains no identifying data for debugging an issue when the website changes.
It's way longer than it should be.
Here's an example (there are a lot of different XPath queries you could write to find this data, I'd suggest you learning and re-writing this query so there are common themes for XPath queries throughout your project) query for grabbing that element:
//div[contains(#class, "detail-address")]//h2/following-sibling::span
The other main source of this problem is sites that extensively rely on JS to modify what is shown on the screen. Conveniently, though, this would be debugged the same was as above. As soon as you glance at the HTML returned on page load, you would notice that the data you are querying doesn't exist until JS executes. At that point, you would need to do some sort of headless browsing.
Since my answer was essentially "write your own XPath" (rather than relying on the browser), I'll leave some sources:
basic XPath introduction
list of XPath functions
XPath Chrome extension
The DOM is manipulated by javascript, so what chrome shows is the xpath after
the all the stuff has happened.
If all you want is to get the cities, you can get it this way (using scrapy):
city_text = response.css('.detail-address span::text').extract_first()
city_code, city_name = city_text.split(maxsplit=1)
Or you can manipulate the JSON in CDATA to get all the data you need:
cdata_text = response.xpath('//*[#id="tdakv"]/text()').extract_first()
json_str = cdata_text.splitlines()[2]
json_str = json_str[json_str.find('{'):]
data = json.loads(json_str) # import json
city_code = data['kvzip']
city_name = data['kvplace']

Selenium Python: clicking links produced by JSON application

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Extracting data from webpage using lxml XPath in Python

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

Categories

Resources