python crawl with requests to get json - python

When I do the crawl, I usually utilize scripts before parsing with python. Since this allows to get JSON which can be easily structured and parsed.
>>> import requests
>>> r = requests.get('~.json')
>>> r.json()
However, encountering this page, https://www.eiganetflix.jp/%E3%82%BF%E3%82%A4%E3%83%97/tv-%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA
It seems there's no interaction to call JSON to show materials on the page.
And it is hard to find pagination javascript functions. (Actually, there is, but I mean it seems hard to execute. )
In this case, how can I utilize existing requests and json method?
Or is there any easy way to crawl this?

If I understand correctly, you want to scrape a webpage which does not have a JSON response. Check to be sure that the website does not have an API that allows you to get JSON data. Or even any other structured data such as XML would also be helpful. If there is no way, you would have to screen scrape, which is not the easiest method to do. Check scrapy which is a framework for doing this, or you can use a library like beautifulsoup for a custom solution.
If the page uses Javascript, you would somehow need to run it on the page to get content and browse pages. You can spynner or Selenium to do that.

Related

Web scraping using python, how to deal with ngif?

I'm trying to read the price of a fund which is not available through an API. The fund is listed here https://bors.e24.no/#!/instrument/KL-AFMI2.OSE
At first I thought this would be a simple task so I looked at beautifulsoup, but realized that what I wanted was not returned. A far as I can tell due to the:
<-- ngIf: $root.allowStreamingToggle -->
I'm a beginner so hoping someone can help me with an easy way to get this value.
I see json being returned from the following endpoint in network tab
import requests
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get('https://bors.e24.no/server/components/graphdata/(PRICE)/DAY/KL-AFMI2.OSE?points=500&stop=2019-07-30&period=1weeks', headers = headers).json()
Price is then
r['rows'][0]['values']['series']['c1']['data'][3][1]
The tag "ngIf" almost certainly means that the website you are attempting to scrape is an AngularJS app... in which case, the data is almost certainly NOT in the HTML page you are pulling and attempting to parse with BeautifulSoup.
Rather, the page is probably pulling the data later -- say, via AJAX -- and the rendering it IN to the page via Angular's client-side code.
If all that is right... then BeautifulSoup is not the right tool.
You might have some hope if you can identify the AJAX call that the page is calling, then call THAT directly. Inspect it to see the data structure; if you are lucky maybe it is JSON and then super easy to parse. If that looks promising, then you can probably simply use the requests library, and skip BeautifulSoup. But you have to do the reverse engineering to figure out WHAT you should be calling.
Here, try this: I did a little snooping with the browser console. Is this the data you are looking for? get info for KL-AFMI2.OSE
If so.. then just use that URL directly in requests.

Efficient web page scraping with Python/Requests/BeautifulSoup

I am trying to grab information from the Chicago Transit Authority bustracker website. In particular, I would like to quickly output the arrival ETAs for the top two buses. I can do this rather easily with Splinter; however I am running this script on a headless Raspberry Pi model B and Splinter plus pyvirtualdisplay results in a significant amount of overhead.
Something along the lines of
from bs4 import BeautifulSoup
import requests
url = 'http://www.ctabustracker.com/bustime/eta/eta.jsp?id=15475'
r = requests.get(url)
s = BeautifulSoup(r.text,'html.parser')
does not do the trick. All of the data fields are empty (well, have &nbsp). For example, when the page looks like this:
This code snippet s.find(id='time1').text gives me u'\xa0' instead of "12 MINUTES" when I perform the analogous search with Splinter.
I'm not wedded to BeautifulSoup/requests; I just want something that doesn't require the overhead of Splinter/pyvirtualdisplay since the project requires that I obtain a short list of strings (e.g. for the image above, [['9','104th/Vincennes','1158','12 MINUTES'],['9','95th','1300','13 MINUTES']]) and then exits.
The bad news
So the bad news is the page you are trying to scrape is rendered via Javascript. Whilst tools like Splinter, Selenium, PhantomJS can render this for you and give you the output to easily scrape, Python + Requests + BeautifulSoup don't give you this out of the box.
The good news
The data pulled in from the Javascript has to come from somewhere, and usually this will come in an easier to parse format (as it's designed to be read by machines).
In this case your example loads this XML.
Now with an XML response it's not as nice as JSON so I'd recommend reading this answer about integrating with the requests library. But it will be a lot more lightweight than Splinter.

Scraping Ajax using Python

I am trying to get the data in the table at this website which is updated via jquery after the page loads (I have permission) :
http://whichchart.com/
I currently use selenium and beautifulsoup to get data, however because this data is not visible in the html source, I can't access it. I have tried PyQt4 but it likewise does not get the updated html source.
The values are visible in firebug and chrome developer, so are there any python packages out there which can exploit this and feed it to beautifulsoup?
I'm not a massive techie so ideally I would like a solution which would work in Python or the next easiest software type.
I'm aware I can get it via proprietary "screen-scraper" software, but that is expensive.
Page is making AJAX call to get a data to http://whichchart.com/service.php?action=NewcastleCoal which returns values in JSON. So you can do the following:
Use urllib to get data using HTTP
Parse that data with json library reads method
Now you have a python object to process
If you need to process HTML page content I would suggest to use library like BeautifulSoup or scrapy

How to crawl a website/extract data into database with python?

I'd like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. Also, I need to log in to access the data.
How would that work?
What tools/libraries can/should I use?
Are there good tutorials on that?
How do I best deal with binary data (e.g. pretty pdf)?
Are there already good solutions for that?
requests for downloading the pages.
Here's an example of how to login to a website and download pages: https://stackoverflow.com/a/8316989/311220
lxml for scraping the data.
If you want to use a powerful scraping framework there's Scrapy. It has some good documentation too. It may be a little overkill depending on your task though.
Scrapy is probably the best Python library for crawling. It can maintain state for authenticated sessions.
Dealing with binary data should be handled separately. For each file type, you'll have to handle it differently according to your own logic. For almost any kind of format, you'll probably be able to find a library. For instance take a look at PyPDF for handling PDFs. For excel files you can try xlrd.
I liked using BeatifulSoup for extracting html data
It's as easy as this:
from BeautifulSoup import BeautifulSoup
import urllib
ur = urllib.urlopen("http://pragprog.com/podcasts/feed.rss")
soup = BeautifulSoup(ur.read())
items = soup.findAll('item')
urls = [item.enclosure['url'] for item in items]
For this purpose there is a very useful tool called web-harvest
Link to their website http://web-harvest.sourceforge.net/
I use this to crawl webpages

Grabbing non-HTML data from a website using python

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.
It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text
If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.
Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

Categories

Resources