Web scraping using python, how to deal with ngif?

Web scraping using python, how to deal with ngif? - python

I'm trying to read the price of a fund which is not available through an API. The fund is listed here https://bors.e24.no/#!/instrument/KL-AFMI2.OSE
At first I thought this would be a simple task so I looked at beautifulsoup, but realized that what I wanted was not returned. A far as I can tell due to the:
<-- ngIf: $root.allowStreamingToggle -->
I'm a beginner so hoping someone can help me with an easy way to get this value.

I see json being returned from the following endpoint in network tab
import requests
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get('https://bors.e24.no/server/components/graphdata/(PRICE)/DAY/KL-AFMI2.OSE?points=500&stop=2019-07-30&period=1weeks', headers = headers).json()
Price is then
r['rows'][0]['values']['series']['c1']['data'][3][1]

The tag "ngIf" almost certainly means that the website you are attempting to scrape is an AngularJS app... in which case, the data is almost certainly NOT in the HTML page you are pulling and attempting to parse with BeautifulSoup.
Rather, the page is probably pulling the data later -- say, via AJAX -- and the rendering it IN to the page via Angular's client-side code.
If all that is right... then BeautifulSoup is not the right tool.
You might have some hope if you can identify the AJAX call that the page is calling, then call THAT directly. Inspect it to see the data structure; if you are lucky maybe it is JSON and then super easy to parse. If that looks promising, then you can probably simply use the requests library, and skip BeautifulSoup. But you have to do the reverse engineering to figure out WHAT you should be calling.
Here, try this: I did a little snooping with the browser console. Is this the data you are looking for? get info for KL-AFMI2.OSE
If so.. then just use that URL directly in requests.

Related

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!

Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

python crawl with requests to get json

When I do the crawl, I usually utilize scripts before parsing with python. Since this allows to get JSON which can be easily structured and parsed.
>>> import requests
>>> r = requests.get('~.json')
>>> r.json()
However, encountering this page, https://www.eiganetflix.jp/%E3%82%BF%E3%82%A4%E3%83%97/tv-%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA
It seems there's no interaction to call JSON to show materials on the page.
And it is hard to find pagination javascript functions. (Actually, there is, but I mean it seems hard to execute. )
In this case, how can I utilize existing requests and json method?
Or is there any easy way to crawl this?

If I understand correctly, you want to scrape a webpage which does not have a JSON response. Check to be sure that the website does not have an API that allows you to get JSON data. Or even any other structured data such as XML would also be helpful. If there is no way, you would have to screen scrape, which is not the easiest method to do. Check scrapy which is a framework for doing this, or you can use a library like beautifulsoup for a custom solution.
If the page uses Javascript, you would somehow need to run it on the page to get content and browse pages. You can spynner or Selenium to do that.

Python requests vs. urllib2

I have used requests library for many times and I know it has a ton of advantages. However, I was trying to retrieve the following Wikipedia page:
https://en.wikipedia.org/wiki/Talk:Land_value_tax
and requests.get retrieves it partially:
response = requests.get('https://en.wikipedia.org/wiki/Talk:Land_value_tax', verify=False)
html = response.text
I tried it using urllib2 and urllib2.urlopen and it retrieves the same page completely:
html = urllib2.urlopen('https://en.wikipedia.org/wiki/Talk:Land_value_tax').read()
Does anyone know why this happens and how to solve it using requests?
By the way, looking at the number of times this post has been viewed, I realized that people are interested to know the differences between these two libraries. If anyone knows about other differences between these two libraries, I'll appreciate it if they edit this question or post an answer and add those differences.

Seems to me the problem lies in the scripting on the target page. The js-driven content is rendered in here (especially i found calls to mediawiki). So, you need to look at web sniffer to identify it:
What to do? If you want to retrieve the whole page content, you better plugin any of libraries working out (evaluating) in page javascript. Read more here.
Update
I am not interested in retrieving the whole page and statistics or JS libraries retrieved from MediaWiki. I only need the whole content of the page (through scraping, not MediaWiki API).
The issue is that those js calls to other resources (incl. mediawiki) make possible to render the WHOLE page to client. But since the library does not support JS execution, js is not executed => page parts are not loaded from other resources => target page is not whole.

How to use urllib to fill out forms and gather data?

I come from a world of scientific computing and number crunching.
I am trying to interact with the internet to compile data so I don't have to. One task it to auto-fill out searches on Marriott.com so I can see what the best deals are all on my own.
I've attempted something simple like
import urllib
import urllib2
url = "http://marriott.com"
values = {'Location':'New York'}
data = urllib.urlencode(values)
website = urllib2.Request(url, data)
response = urllib2.urlopen(website)
stuff = response.read()
f = open('test.html','w')
f.write(stuff)
My questions are the following:
How do you know how the website receives information?
How do I know a simple "Post" will work?
If it is simple, how do I know what the names of the dictionary should be for "Values?"
How to check if it's working? The write lines at the end are an attempt for me to see if my inputs are working properly but that is insufficient.

You may also have a look at splinter, where urllib may not be useful (JS, AJAX, etc.)
For finding out the form parameters firebug could be useful.

You need to read and analyze the HTML code of the related side. Every browser has decent tools for introspecting the DOM of a site, analyzing the network traffic and requests.
Usually you want to use the mechanize module for performing automatized interactions with a web site. There is no guarantee given that this will work in every case. Nowadays many websites use AJAX or more complex client-side programming making it hard to "emulate" a human user using Python.
Apart from that: the mariott.com site does not contain an input field "Location"...so you are guessing URL parameters with having analyzed their forms and functionality?

What i do to check is use a Web-debugging proxy to view the request you send
first send a real request with your browser and compare that request to the request that your script sends. try to make the two requests match
What I use for this is Charles Proxy
Another way is view the html file you saved (in this case test.html) and view it in your browser and compare this to the actual request reponse
To findout what the dictionary should have in it is look at the page source of the page and find out the names of the forms your trying to fill. in you're case the "location"should actually be "destinationAddress.destination"
Here is a picture:
So look in the HTML code to get the names of the forms and that is what should be in the dictionary. i know that Google Chrome and Mozilla Firefox both have tools to view the structure of the html (in the Picture i used inspect element in Google Chrome)
for more info on urllib2 read here
I really hope this helps :)

Grabbing non-HTML data from a website using python

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.

It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text

If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.

Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping using python, how to deal with ngif? - python

Related

Python Requests only pulling half of intented tags

python crawl with requests to get json

Python requests vs. urllib2

How to use urllib to fill out forms and gather data?

Grabbing non-HTML data from a website using python

Categories

Resources