I'm new in scraping. I've wrote a scraper which will scrape Maplin store. I used Python as the language and BeautifulSoup to scrape the store.
I want to ask that if I need to scrape some other eCommerce store (say Amazon, Flipkart), do I need to customize my code since they have different HTML schema (id and class names are different, plus other things as well). So, the scraper I wrote will not work for other eCommerce store.
I want to know how price-comparison sites scrape data from all the online stores? Do they have different code for different online store or is there's a generic one? Do they study the HTML schema of every online store?
do I need to customize my code
Yes, sure. It is not only because the web-sites have different HTML schema. It is also about the mechanisms involved in loading/rendering the page: some sites use AJAX to load partial content of a page, others let the javascript fill out the placeholders on the page which makes it harder to scrape - there can be lots and lots of differences. Others would use anti-web-scraping techniques: check your headers, behavior, ban you after hitting a site too often, etc.
I've also seen cases when prices were kept as images, or obfuscated with a "noise" - different tags inside one another that were hidden using different techniques, like CSS rules, classes, JS code, "display: None" etc - for an end-user in a browser the data looked normally, but for a web-scraping "robot" it was a mess.
want to know how price-comparison sites scrape data from all the online stores?
Usually, they use APIs whenever possible. But, if not, web-scraping and HTML parsing is always an option.
The general high-level idea is to split the scraping code into two main parts. The static one is a generic web-scraping spider (logic) that reads the parameters or configuration that is passed in. And a dynamic one - an annotator/web-site specific configuration - is usually field-specific XPath expressions or CSS selectors.
See, as an example, Autoscraping tool provided by Scrapinghub:
Autoscraping is a tool to scrape web sites without any programming
knowledge. You just annotate web pages visually (with a point and
click tool) to indicate where each field is on the page and
Autoscraping will scrape any similar page from the site.
And, FYI, study what Scrapinghub offers and documents - there is a lot of useful information and a set of different unique web-scraping tools.
I've personally been involved in a project where we were building a generic Scrapy spider. As far as I remember, we had a "target" database table where records were inserted by a browser extension (annotator), field annotations were kept in JSON:
{
"price": "//div[#class='price']/text()",
"description": "//div[#class='title']/span[2]/text()"
}
The generic spider received a target id as a parameter, read the configuration, and crawled the web-site.
We had a lot of problems staying on a generic side. Once a web-site involved javascript and ajax, we started to write site-specific logic to get to the desired data.
See also:
Creating a generic scrapy spider
Using one Scrapy spider for several websites
What is the best practice for writing maintainable web scrapers?
For a lot of the pricing comparison scrapers, they will do the product search on the vendor site when a user indicates they wish to track a price of something. Once the user selects what they are interested in this will be added to a global cache of products that can then be periodically scraped rather than having to always trawl the whole site on a frequent basis
Related
I am looking for a python module that will let me navigate searchbars, links etc of a website.
For context I am looking to do a little webscraping of this website [https://www.realclearpolitics.com/]
I simply want to take information on each state (polling data etc) in relation to the 2020 election and organize it all in a collection of a database.
Obviously there are a lot of states to go through and each is on a seperate webpage. So im looking for a method in python in which i could quickly navigate the site and take the data of each page etc aswell as update and add to existing data. So finding a method of quickly navigating links and search bars with my inputted data would be very helpful.
Any suggestions would be greatly appreciated.
# a simple list that contains the names of each state
states = ["Alabama", "Alaska" ,"Arizona", "....."]
for state in states:
#code to look up the state in the searchbar of website
#figures being taken from website etc
break
Here is the rough idea i have
There are many options to accomplish this with Python. As #LD mentioned, you can use Selenium. Selenium is a good option if you need to interact with a websites UI via a headless browser. E.g clicking a button, entering text into a search bar, etc. If your needs aren't that complex, for instance if you just need to quickly scrape all the raw content from a web page and process it, than you should use the requests module from Python's standard library.
For processing raw content from a crawl, I would recommend beautiful soup.
Hope that helps!
Full disclaimer - I'm not a programmer. I'm trying to get the 12 month rent price (which is currently 1,976) by scraping the following webpage - https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing. My problem is that when I enter the below into my shell terminal, no results are being returned even though I expect some sort of information. I thought this would have been relatively straightforward from the tutorials I've watched, but this website looks to be structured differently (perhaps more complex). I used SelectorGadget to verify the CSS Selector is correct. What am I missing?
scrapy shell "https://www.essexapartmenthomes.com/apartments/bonita-cedars/floor-plans-and-pricing"
response.css('.pricing-list::text').extract()
It's not going to be that easy since the linked page relies heavily on JavaScript. You have two options:
You can use use a rendering engine like splash to render the JavaScript after you load the page and see if you can extract the data
Or you can see what endpoints the site uses to fetch the data which you can fetch yourself manually.
Either way, it's not going to be as trivial as you thought and might be a good idea to consult someone with experience.
I'm trying to scrape a website that provides individual access to court cases in New Jersey county courts. I'm having a lot of trouble figuring out how to start though. I've scraped quite a few websites before but I've usually been able to start by adapting the URL to pass through the search parameters. However, when I access this data the URL does not change so I'm at a bit of a loss.
Additionally, there is a test for me to prove that I am not a Robot (which occasionally turns into a ReCaptcha).
On the website linked above, say, for example, the inputs would be:
Case County==Bergen, Docket Type==Landlord Tenant (LT), Docket Number==000001, and Docket Year==19.
I would then like to be able to extract the Defendant Name or anything from the subsequent page.
Does anyone have any advice on how I should proceed with this?
Thanks in advance
Websites which "require input" can be scraped using Selenium, which evaluates the javascript: your python code then executes the page more as a "user" (click here, type there). It's slow.
Alternatively, if you look at the page details, you may see what happens to input, and simply execute the resulting GET or POST url properly formed (For example, Forms, often, will do a POST with the parameters: Look at the code and figure out what parameters get posted and to what URL, and then in python, execute that POST code -- you'll probably need a cookiejar to maintain session info.
HOWEVER As a website maintainer, my advice to you is to not attempt to scrape this site: it doesn't want to be scraped & repeated attempts only escalate defensive activities on the part of the website owner. You may also be violating usage policy, state and/or federal laws.
Instead, look for an alternative API, or alternative source. (NJ Courts may have an alternative API, designed for computer usage: send them an email!)
I have a list of approx. 52 websites which lead to about approx. 150 webpages that i require scraping on. Based on my ignorance and lack of research i started building crawlers per webpage which is starting to become to difficult to complete and maintain.
Based on my analysis thus far I already know what information i want to scrape per webpage and it is clear that these websites have their own structure. On the plus side i noticed that each website has some commonalities in their web structure among their webpages.
My million dollar question, is there a single technique or single web crawler that i can use to scrape these sites? I already know the information that I want, these sites are rarely updated in terms of their web structure and most of these sites have documents that need to be downloaded.
Alternatively, is there a better solution to use that will reduce the amount of web crawlers that I need to build? additionally, these web crawlers will only be used to download the new information of the websites that i am aiming them at.
[…] i started building crawlers per webpage which is starting to become to difficult to complete and maintain […] it is clear that these websites have their own structure. […] these sites are rarely updated in terms of their web structure […]
If websites have different structures, having separate spiders makes sense, and should make maintenance easier in the long term.
You say completing new spiders (I assume you mean developing them, not crawling or something else) is becoming difficult, however if they are similar to an existing spider, you can simply copy-and-paste the most similar existing spider, and make only the necessary changes.
Maintenance should be easiest with separate spiders for different websites. If a single website changes, you can fix the spider for that website. If you have a spider for multiple websites, and only one of them changes, you need to make sure that your changes for the modified website do not break the rest of the websites, which can be a nightmare.
Also, since you say website structures do not change often, maintenance should not be that hard in general.
If you notice you are repeating a lot of code, you might be able to extract some shared code into a spider middleware, a downloader middleware, an extension, an item loader, or even a base spider class shared by two or more spiders. But I would not try to use a single Spider subclass to scrape multiple different websites that are likely to evolve separately.
I suggest you crawl specific tags such as body, h1,h2,h3,h4,h5, h6,p and... for each links. You can gather all p tags and append them into a specific link. It can be used for each tags you want to crawl them. Also, you can append related links of tags to your database.
For one of my web project I need to scrape data from different web sources. To keep it simple i am explaining with an example.
Lets say i want to scrape the data about mobiles listed in their manufacturer site.
http://www.somebrand1.com/mobiles/
.
.
http://www.somebrand3.com/phones/
I have huge list of URLs.
Every brand's page will have their own way of HTML presentation for browser.
How can i write a normalized script to traverse the HTML of those listing web page URLs and scrape the data irrespective of the format they are in?
Or else do i need to write a script to scrape data from every pattern?
This is called a Broad Crawling and, generally speaking, this is not an easy thing to implement because of the different nature, representation, loading mechanisms web-sites use.
The general idea would be to have a generic spider and some sort of a site-specific configuration where you would have a mapping between item fields and xpath expressions or CSS selectors used to retrieve the field values from the page. In a real life, things are not that simple as it seems, some fields would require post-processing, other fields would need to be extracted after sending a separate request etc. In other words, it would be very difficult to keep generic and reliable at the same time.
The generic spider should receive a target site as a parameter, read the site-specific configuration and crawl the site according to it.
Also see:
Broad Crawls