This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.
The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.
This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.
Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.
Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(
There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.
Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.
i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.
This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?
Related
I'd like to ask somebody with experience with headless browsers and python if it's possible to extract box info with distance from closest strike on webpage below. Till now I was using python bs4 but since everything is driven by jQuery here simple download of webpage doesn't work. I found PhantomJS but I wasn't able extract it too so I am not sure if it's possible. Thanks for hints.
https://lxapp.weatherbug.net/v2/lxapp_impl.html?lat=49.13688&lon=16.56522&v=1.2.0
This isn't really a Linux question, it's a StackOverflow question, so I won't go into too much detail.
The thing you want to do can be easily done with Selenium. Selenium has both a headless mode, and a heady mode (where you can watch it open your browser and click on things). The DOM query API is a bit less extensive than bs4, but it does have nice visual query (location on screen) functions. So you would write a Python script that initializes Selenium, goes to your website and interacts with it. You may need to do some image recognition on screenshots at some point. It may be as simple as finding for a certain query image on the screen, or something much more complicated.
You'd have to go through the Selenium tutorials first to see how it works, which would take you 1-2 days. Then figure out what Selenium stuff you can use to do what you want, that depends on luck and whether what you want happens to be easy or hard for that particular website.
Instead of using Selenium, though, I recommend trying to reverse engineer the API. For example, the page you linked to hits https://cmn-lx.pulse.weatherbug.net/data/lightning/v1/spark with parameters like:
_
callback
isGpsLocation
location
locationtype
safetyMessage
shortMessage
units
verbose
authid
timestamp
hash
You can figure out by trial and error which ones you need and what to put in them. You can capture requests from your browser and then read them yourself. Then construct appropriate requests from a Python program and hit their API. It would save you from having to deal with a Web UI designed for humans.
I am trying to extract data from a web site and the data are in a table :
url=requests.get("xxxxx")
soup =BeautifulSoup(url.content)
table=soup.find_all("table")[0]
rows = table.find_all('tr')
I tried this code it works but only 42 lines are extracted and the source table contains 220 lines ?
someone tell me how to fix this.
Welcome.
2 possibilities. Javascript or website security.
requests is javscript agnostic and doesn't execute any javascript code. You'll want a headless browser solution (selenium is popular) that more closely mimicks a browser, especially when it comes to javascript.
Many websites don't want to be scraped and employ different methods to prevent it. The simplest form is checking the User-Agent value of the client (your Python script) or rate-throttling (20k refreshes a second isn't human). e.g., if the User-Agent is anything other than a known value, it'll behave differently (little or no data). Other forms of defense are more complex. Such as trying to play audio on your "browser" or polling your "browser"'s resolution. For that you'll need to investigate the site's behavior. This can take time. You can start off with either the Networking tab of your browser's developing tools (F12 on Firefox) or Zap Proxy for more refined control.
Full disclaimer - I'm not a programmer. I'm trying to get the 12 month rent price (which is currently 1,976) by scraping the following webpage - https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing. My problem is that when I enter the below into my shell terminal, no results are being returned even though I expect some sort of information. I thought this would have been relatively straightforward from the tutorials I've watched, but this website looks to be structured differently (perhaps more complex). I used SelectorGadget to verify the CSS Selector is correct. What am I missing?
scrapy shell "https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing"
response.css('.pricing-list::text').extract()
It's not going to be that easy since the linked page relies heavily on JavaScript. You have two options:
You can use use a rendering engine like splash to render the JavaScript after you load the page and see if you can extract the data
Or you can see what endpoints the site uses to fetch the data which you can fetch yourself manually.
Either way, it's not going to be as trivial as you thought and might be a good idea to consult someone with experience.
I'd like to use Python to scrape the contents of the "Were you looking for these authors:" box on web pages like this one: http://academic.research.microsoft.com/Search?query=lander
Unfortunately the contents of the box get loaded dynamically by JavaScript. Usually in this situation I can read the Javascript to figure out what's going on, or I can use an browser extension like Firebug to figure out where the dynamic content is coming from. No such luck this time...the Javascript is pretty convoluted and Firebug doesn't give many clues about how to get at the content.
Are there any tricks that will make this task easy?
Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.
If you run the following query in a chrome console, you'll see it returns everything you want.
document.getElementsByClassName('inline-text-org');
Returns
[<div class="inline-text-org" title="University of Manchester">University of Manchester</div>,
<div class="inline-text-org" title="University of California Irvine">University of California ...</div>
etc...
You can run JavaScript through python in a real life DOM using ghost.py.
This is really cool:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
"document.getElementsByClassName('inline-text-org');")
A very similar question was asked earlier here.
Quoted is selenium, originally a testing environment for web-apps.
I usually use Chrome's Developer Mode, which IMHO already gives even more details than Firefox.
For scraping dynamic content, you need not a simple scraper but a full-fledged headless browser.
dhamaniasad/HeadlessBrowsers: A list of (almost) all headless web browsers in existence is the fullest list of these that I've seen; it lists which languages each has bindings for.
(Note that more than a few of the listed projects are abandoned!)
I'm a Python programmer specializing in web-scraping, I had to ask this question as I found nothing relevant.
I want to know what are the popular, well documented frameworks that are available for Python for scraping pure Javascript based sites? Currently I know Mechanize and Beautiful Soup but they do not interact with Javascript so I'm looking for something different. I would prefer something that would be as elegant and simple as mechanize.
I've done a bit of research and so far I've heard about Selenium, Selenium 2 and Windmill.
Right now I'm trying to choose among one these three and I do not know of any others.
So can anyone point out the features of these frameworks and what makes them different? I heard that Selenium uses a separate server to do all it's task and it seems to be feature rich. Also what is the core difference between Selenium and Selenium2? Please enlighten me if I'm wrong, and if you know of any other frameworks do mention it's features and other details.
Thanks.
Before using tools like Selenium that are designed for front end testing and not for scraping, you should have a look at where the data on the site comes from. Find out what XHR requests are made, what parameters they take and what the result is.
For example the site you mentioned in your comment does a POST request with lots of parameters in JavaScript and displays the result. You probably only need to use the result of this POST request to get your data.