I'm trying to scrape a website that provides individual access to court cases in New Jersey county courts. I'm having a lot of trouble figuring out how to start though. I've scraped quite a few websites before but I've usually been able to start by adapting the URL to pass through the search parameters. However, when I access this data the URL does not change so I'm at a bit of a loss.
Additionally, there is a test for me to prove that I am not a Robot (which occasionally turns into a ReCaptcha).
On the website linked above, say, for example, the inputs would be:
Case County==Bergen, Docket Type==Landlord Tenant (LT), Docket Number==000001, and Docket Year==19.
I would then like to be able to extract the Defendant Name or anything from the subsequent page.
Does anyone have any advice on how I should proceed with this?
Thanks in advance
Websites which "require input" can be scraped using Selenium, which evaluates the javascript: your python code then executes the page more as a "user" (click here, type there). It's slow.
Alternatively, if you look at the page details, you may see what happens to input, and simply execute the resulting GET or POST url properly formed (For example, Forms, often, will do a POST with the parameters: Look at the code and figure out what parameters get posted and to what URL, and then in python, execute that POST code -- you'll probably need a cookiejar to maintain session info.
HOWEVER As a website maintainer, my advice to you is to not attempt to scrape this site: it doesn't want to be scraped & repeated attempts only escalate defensive activities on the part of the website owner. You may also be violating usage policy, state and/or federal laws.
Instead, look for an alternative API, or alternative source. (NJ Courts may have an alternative API, designed for computer usage: send them an email!)
Related
Full disclaimer - I'm not a programmer. I'm trying to get the 12 month rent price (which is currently 1,976) by scraping the following webpage - https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing. My problem is that when I enter the below into my shell terminal, no results are being returned even though I expect some sort of information. I thought this would have been relatively straightforward from the tutorials I've watched, but this website looks to be structured differently (perhaps more complex). I used SelectorGadget to verify the CSS Selector is correct. What am I missing?
scrapy shell "https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing"
response.css('.pricing-list::text').extract()
It's not going to be that easy since the linked page relies heavily on JavaScript. You have two options:
You can use use a rendering engine like splash to render the JavaScript after you load the page and see if you can extract the data
Or you can see what endpoints the site uses to fetch the data which you can fetch yourself manually.
Either way, it's not going to be as trivial as you thought and might be a good idea to consult someone with experience.
The site I am trying to scrap has drop-down menus that end up producing a link to a document. The end documents are what I want. I have no experience with web scraping so I don't know where to start on this. I don't know where to start. I have tried adapting this to my needs, but I couldn't get it working. I also tried to adapt this.
I know basically I need to:
for state in states:
select state
for type in types:
select type
select wage_area_radio button
for area in wage_area:
select area
for locality in localities:
select locality
for date in dates:
select date
get_document
I just haven't found anything that works for me yet. Is there a tool better than Selenium for this? I am currently trying to bend it to my will using the the code from my second example as a starter.
Depending on your coding skills and knowledge of HTTP, I would try one of two things. Note that scraping this site appears slightly non-trivial because of the different form options that appear based on what was previously selected, and the fact that there's a lot of AJAX calls happening.
1) Follow the HTTP requests (especially the AJAX ones) that are being made in something like Chrome DevTools. You'll get a good understanding of how the final URL is being formed and how to construct it yourself. In particular, it looks like the last POST to AFWageScheduleYearSelected is the one that generates the final url. Then, you can make these calls yourself in a Python HTTP library to get the documents.
2) Use something like PhantomJS (http://phantomjs.org/) which is a headless browser. I don't have experience scraping with Selenium, but my understanding is that it is more of a testing/automation tool. In any case, PhantomJS is pretty easy to get up and running and you can basically click page elements, fill out forms, etc.
If you do end up using PhantomJS (or any other browser-like tool), you'll run into issues with the AJAX calls that populate the forms. Basically, you'll end up trying to fill out forms that don't yet exist on the page because the data is still being sent over the network. The easiest way to get around this is to just set timeouts (of say 2 seconds) in between each form field that you fill out. The alternative to using timeouts (which may be unreliable and slow) is to continuously poll the page until the AJAX call is finished.
I am trying to get the names of first 200 Facebook users.
I am using Python and BeautifulSoup
The approach which I'm using is that instead of using Graph API, I'm trying to get the names using the title of the webpage.(The title of the profile webpage is the name of the person)
The first user is Zuckerberg(id:4). I want names till 200.
Here's what I've tried.
import urllib2
from BeautifulSoup import BeautifulSoup
x=4
while(x<=200):
print BeautifulSoup(urllib2.urlopen("https://www.facebook.com/"+str(x))).title.string
x+=1
Can anyone help?
Well, I concur with the other commenters that there is barely enough information to figure out what the problem is. But reading between the lines a bit, I imagine the OP is expecting that results for pages such as
https://www.facebook.com/4
which gets redirected to https://www.facebook.com/zuck, Mark Zuckerberg's page, and https://www.facebook.com/5
which gets redirected to https://www.facebook.com/ChrisHughes, another early Facebook employee, will continue to work for further arbitrary user IDs the OP plugs in. In fact, I believe this trick did used to work in the past... until someone posted a spreadsheet of the first 2000 Facebook users somewhere, and Facebook clamped down on this hole (this is from memory, I bet there's a news story out there if someone feels like digging).
Anyway, trying further user IDs in the URL such as:
https://www.facebook.com/7 now gives a "Sorry, this page isn't available" response. To the OP, I don't think there's any easy way you can code around this -- Zuck obviously doesn't care that you're harvesting his own page, but I guess he's not keen on letting you scrape the entire Facebook user list. Sorry.
Update: you might still be able to get away with such harvesting using Facebook's Graph API -- it appears that GETs of pages like https://graph.facebook.com/100 will work for most User IDs. You should be able to script up what you need from there (if I were Facebook, I would have rate-limiting in place to prevent mass harvesting, but you'll have to try and see what you get for yourself.) Here's a script similar to what you're trying to do.
I'm new in scraping. I've wrote a scraper which will scrape Maplin store. I used Python as the language and BeautifulSoup to scrape the store.
I want to ask that if I need to scrape some other eCommerce store (say Amazon, Flipkart), do I need to customize my code since they have different HTML schema (id and class names are different, plus other things as well). So, the scraper I wrote will not work for other eCommerce store.
I want to know how price-comparison sites scrape data from all the online stores? Do they have different code for different online store or is there's a generic one? Do they study the HTML schema of every online store?
do I need to customize my code
Yes, sure. It is not only because the web-sites have different HTML schema. It is also about the mechanisms involved in loading/rendering the page: some sites use AJAX to load partial content of a page, others let the javascript fill out the placeholders on the page which makes it harder to scrape - there can be lots and lots of differences. Others would use anti-web-scraping techniques: check your headers, behavior, ban you after hitting a site too often, etc.
I've also seen cases when prices were kept as images, or obfuscated with a "noise" - different tags inside one another that were hidden using different techniques, like CSS rules, classes, JS code, "display: None" etc - for an end-user in a browser the data looked normally, but for a web-scraping "robot" it was a mess.
want to know how price-comparison sites scrape data from all the online stores?
Usually, they use APIs whenever possible. But, if not, web-scraping and HTML parsing is always an option.
The general high-level idea is to split the scraping code into two main parts. The static one is a generic web-scraping spider (logic) that reads the parameters or configuration that is passed in. And a dynamic one - an annotator/web-site specific configuration - is usually field-specific XPath expressions or CSS selectors.
See, as an example, Autoscraping tool provided by Scrapinghub:
Autoscraping is a tool to scrape web sites without any programming
knowledge. You just annotate web pages visually (with a point and
click tool) to indicate where each field is on the page and
Autoscraping will scrape any similar page from the site.
And, FYI, study what Scrapinghub offers and documents - there is a lot of useful information and a set of different unique web-scraping tools.
I've personally been involved in a project where we were building a generic Scrapy spider. As far as I remember, we had a "target" database table where records were inserted by a browser extension (annotator), field annotations were kept in JSON:
{
"price": "//div[#class='price']/text()",
"description": "//div[#class='title']/span[2]/text()"
}
The generic spider received a target id as a parameter, read the configuration, and crawled the web-site.
We had a lot of problems staying on a generic side. Once a web-site involved javascript and ajax, we started to write site-specific logic to get to the desired data.
See also:
Creating a generic scrapy spider
Using one Scrapy spider for several websites
What is the best practice for writing maintainable web scrapers?
For a lot of the pricing comparison scrapers, they will do the product search on the vendor site when a user indicates they wish to track a price of something. Once the user selects what they are interested in this will be added to a global cache of products that can then be periodically scraped rather than having to always trawl the whole site on a frequent basis
I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html