Python Scrapy Isn't Extracting Data

Python Scrapy Isn't Extracting Data - python

Full disclaimer - I'm not a programmer. I'm trying to get the 12 month rent price (which is currently 1,976) by scraping the following webpage - https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing. My problem is that when I enter the below into my shell terminal, no results are being returned even though I expect some sort of information. I thought this would have been relatively straightforward from the tutorials I've watched, but this website looks to be structured differently (perhaps more complex). I used SelectorGadget to verify the CSS Selector is correct. What am I missing?
scrapy shell "https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing"
response.css('.pricing-list::text').extract()

It's not going to be that easy since the linked page relies heavily on JavaScript. You have two options:
You can use use a rendering engine like splash to render the JavaScript after you load the page and see if you can extract the data
Or you can see what endpoints the site uses to fetch the data which you can fetch yourself manually.
Either way, it's not going to be as trivial as you thought and might be a good idea to consult someone with experience.

Related

BeautifulSoup find returning "None" [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.

The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.

This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.

Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.

Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(

There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.

Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.

i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.

This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?

I am trying to scrape this website for all of the documents that are produced from the drop down forms

The site I am trying to scrap has drop-down menus that end up producing a link to a document. The end documents are what I want. I have no experience with web scraping so I don't know where to start on this. I don't know where to start. I have tried adapting this to my needs, but I couldn't get it working. I also tried to adapt this.
I know basically I need to:
for state in states:
select state
for type in types:
select type
select wage_area_radio button
for area in wage_area:
select area
for locality in localities:
select locality
for date in dates:
select date
get_document
I just haven't found anything that works for me yet. Is there a tool better than Selenium for this? I am currently trying to bend it to my will using the the code from my second example as a starter.

Depending on your coding skills and knowledge of HTTP, I would try one of two things. Note that scraping this site appears slightly non-trivial because of the different form options that appear based on what was previously selected, and the fact that there's a lot of AJAX calls happening.
1) Follow the HTTP requests (especially the AJAX ones) that are being made in something like Chrome DevTools. You'll get a good understanding of how the final URL is being formed and how to construct it yourself. In particular, it looks like the last POST to AFWageScheduleYearSelected is the one that generates the final url. Then, you can make these calls yourself in a Python HTTP library to get the documents.
2) Use something like PhantomJS (http://phantomjs.org/) which is a headless browser. I don't have experience scraping with Selenium, but my understanding is that it is more of a testing/automation tool. In any case, PhantomJS is pretty easy to get up and running and you can basically click page elements, fill out forms, etc.
If you do end up using PhantomJS (or any other browser-like tool), you'll run into issues with the AJAX calls that populate the forms. Basically, you'll end up trying to fill out forms that don't yet exist on the page because the data is still being sent over the network. The easiest way to get around this is to just set timeouts (of say 2 seconds) in between each form field that you fill out. The alternative to using timeouts (which may be unreliable and slow) is to continuously poll the page until the AJAX call is finished.

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T

I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.

You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.

In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.

Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Choosing a Python webscraping framework for handling pure Javascript based sites

I'm a Python programmer specializing in web-scraping, I had to ask this question as I found nothing relevant.
I want to know what are the popular, well documented frameworks that are available for Python for scraping pure Javascript based sites? Currently I know Mechanize and Beautiful Soup but they do not interact with Javascript so I'm looking for something different. I would prefer something that would be as elegant and simple as mechanize.
I've done a bit of research and so far I've heard about Selenium, Selenium 2 and Windmill.
Right now I'm trying to choose among one these three and I do not know of any others.
So can anyone point out the features of these frameworks and what makes them different? I heard that Selenium uses a separate server to do all it's task and it seems to be feature rich. Also what is the core difference between Selenium and Selenium2? Please enlighten me if I'm wrong, and if you know of any other frameworks do mention it's features and other details.
Thanks.

Before using tools like Selenium that are designed for front end testing and not for scraping, you should have a look at where the data on the site comes from. Find out what XHR requests are made, what parameters they take and what the result is.
For example the site you mentioned in your comment does a POST request with lots of parameters in JavaScript and displays the result. You probably only need to use the result of this POST request to get your data.

Grabbing non-HTML data from a website using python

I'm trying to get the current contract prices on this page to a string: http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500.html
I would really like a python 2.6 solution.
It was easy to get the page html using urllib, but it seems like this number is live and not in the html. I inspected the element in Chrome and it's some td class thing.
But I don't know how to get at this with python. I tried beautifulsoup (but after several attempts gave up getting a tar.gz to work on my windows x64 system), and then elementtree, but really my programming interest is data analysis. I'm not a website designer and don't really want to become one, so it's all kind of a foreign language. Is this live price XML?
Any assistance gratefully received. Ideally a simple to install module and some actual code, but all hints and tips very welcome.

It looks like the numbers in the table are filled in by Javascript, so just fetching the HTML with urllib or another library won't be enough since they don't run the javascript. You'll need to use a library like PyQt to simulate the browser rendering the page/executing the JS to fill in the numbers, then scrape the output HTML of that.
See this blog post on working with PyQt: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/link text

If you look at that website with something like firebug, you can see the AJAX calls it's making. For instance the initial values are being filled in with a AJAX call (at least for me) to:
http://www.cmegroup.com/CmeWS/md/MDServer/V1/Venue/G/Exchange/XCME/FOI/FUT/Product/ES?currentTime=1292780678142&contractCDs=,ESH1,ESM1,ESU1,ESZ1,ESH2,ESH1,ESM1,ESU1,ESZ1,ESH2
This is returning a JSON response, which is then parsed by javascript to fill in the tabel. It would be pretty simple to do that yourself with urllib and then use simplejson to parse the response.
Also, you should read this disclaimer very carefully. What you are trying to do is probably not cool with the owners of the web-site.

Its hard to know what to tell you wothout knowing where the number is coming from. It could be php or asp also, so you are going to have to figure out which language the number is in.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Scrapy Isn't Extracting Data - python

Related

BeautifulSoup find returning "None" [duplicate]

I am trying to scrape this website for all of the documents that are produced from the drop down forms

Read all pages within a domain

Choosing a Python webscraping framework for handling pure Javascript based sites

Grabbing non-HTML data from a website using python

Categories

Resources