Web scraping with beautifulsoup on amazon - python

I can only get the first 30 products in the best selling products on amazon.
Is there a restriction for more?
Although the class of all 50 products is the same, only 30 products come with the find_all method

Related

How to scrape data from multiple unrelated sections of a website (using Scrapy)

I have made a Scrapy web crawler which can scrape Amazon. It can scrape by searching for items using a list of keywords and scrape the data from the resulting pages.
However, I would like to scrape Amazon for large portion of its product data. I don't have a preferred list of keywords with which to query for items. Rather, I'd like to scrape the website evenly and collect X number of items which is representative of all products listed on Amazon.
Does anyone know how scrape a website in this fashion? Thanks.
I'm putting my comment as an answer so that others looking for a similar solution can find it easier.
One way to achieve this is to going through each category (furniture, clothes, technology, automotive, etc.) and collecting a set number of items there. Amazon has side/top bars with navigation links to different categories, so you can let it run through there.
The process would be as follows:
Follow category urls from initial Amazon.com parse
Use a different parse function for the callback, one that will scrape however many items from that category
Ensure that data is writing to a file (it will probably be a lot of data)
However, such an approach would not be representative in the proportions of each category in the total Amazon products. Try looking for a "X number of results" label for each category to compensate for that. Good luck with your project!

How to scrape tens of thousands urls every night using scrapy

I am using scrapy to scrape some big brands to import the sale data for my site. Currently I am using
DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16
I am using Item loader to specify css/xpath rules and Pipeline to write the data into csv. The data that I collect is original price, sale price, colour, sizes, name, image url and brand.
I have written the spider for only one merchant who has around 10k urls and it takes me about 4 hours.
My question is, does 4 hours sounds alright for 10k urls or it should be faster than that. If so, what else do I need to do to speed it up.
I am using only one SPLASH instance locally to test. But in production I am planning to use 3 SPLASH instances.
Now the main issue, I have about 125 merchants and on avg 10k product for each. couple of them has more than 150k urls to scrape.
I need to scrape all their data every night to update my site.
Since my single spider takes 4 hours to scrape 10k urls, I am wondering if it is even valid dream to achieve 125 x 10k urls every night
I will really appreciate your experienced input to my problem.
Your DOWNLOAD_DELAY is enforced per IP, so if there is only 1 IP, then 10,000 requests will take 15000 seconds (10,000 * 1.5). That is just over 4 hours. So yeah that's about right.
If you are scraping more than one site, then they will be different IP addresses, and so they should run more or less in parallel, so should still take 4 hours or so.
If you are scraping 125 sites, then you will probably hit a different bottleneck at some point.

Scraping products data with categories from e-commerce

I need to develop an application that takes as input an url of an e-commerce website and scrap the products titles, prices with the categories and sub-categories.
Scrapy seems like a good solution for scraping data, so my question is how can I tell scrapy where the titles, prices, cat and sub categories are to extract them knowing that websites have different structures and don't really use the same tags?
EDIT: I gotta change my question to this, can't we write a generic spider that takes the start url, allowed domains, and xpath or css selectors as arguments?
Categories and subcategories are usually in the breadcrumbs.
In general the css selector for those will be .breadcrumb a and that will probably work for 80% of modern e-commerce websites.

Get Ebay product category from its URL with Python

I need to retrieve the customers intention-to-buy analyzing a large set of ebay urls. Since these urls are not semantically significant (i.e. I cannot parse the URL in order to find a useful information about the product), I was thinking to retrieve the 'product category' from its ebay URL. For product category I mean, for example,
Home & Garden>Bedding>Sheets & Pillowcases or
Musical Instruments & Gear>Guitars & Basses
I'm trying theebaysdk(https://github.com/timotheus/ebaysdk-python) for Python, but I haven't find anything regarding the possibility of getting a category product from its URL.
Also with a different ebay api, is this feaseable?

How to handle dynamic URLs while crawling online stores?

I am crawling online stores for price comparison. Mot of the stores are using dynamic URLs heavily. This is causing my crawler to spend lot of time on every online stores. Even though most of them have only 5-6k unique products, they have unique URLs >= 300k. Any idea how to get around this.
Thanks in advance!
If you parsing some product pages, usually these URLs have some kind of product id.
Find the pattern to extract product id from URLs, and use it to filter already visited URLs.

Categories

Resources