I started to look at Scrapy and want to have one spider to get some prices of MTG Cards.
First I don't know if I'm 100% correct to use the link that select all the cards available in the beginning of the function:
name = 'bazarmtgbot'
allowed_domains = ['www.bazardebagda.com.br']
start_urls = ['https://bazardebagda.com.br/?view=ecom/itens&tcg=1&txt_estoque=1&txt_limit=160&txt_order=1&txt_extras=all&page=1']
1 - Should I use this kind of start_urls?
2 - Then, if you access the site, I could not find how to get the unit and price of the card, they are blank DIV's...
I got the name using:
titles = response.css(".itemNameP.ellipsis::text").extract()
3 - I couldn't find how can I do the pagination of this site to get the next set of items unit/prices. Do I need to copy the start_urls N times?
(and 3) That would be fine to start on a given page. When scraping you can queue additional URLs to scrape by looking for something like the "next page" button, scraping that link, and yield'ing a scrapy.Request that you want to follow-up on. See this part of the Scrapy tutorial
That site may be using a bunch of techniques to thwart price scraping: the blank price divs are loading an image like the below and chopping parts of it up with gibberish CSS class names to form the number. You may need to do some OCR or find an alternative method. Bear in mind that because they're going to that degree, there might be other anti-scraping countermeasures.
Related
So I've been trying to extract every single phone number from a website that deals in properties (renting/buying houses,apartments, etc).
There's plenty of categories (cities, type of properties) and ads in each of those. Whenever you enter an ad, there's obviously more pictures, descriptions, and a phone number at the bottom.
This is the site in question.
https://www.nekretnine.rs/
I wrote a python script that's supposed to extract those phone numbers, but it's giving me nothing. This is the script.
I figure it's not working cuz its looking for that information from the home page, and the info is not there, but I just can't figure out how to include all those ads across all those categories in my loop. Don't even ask about API, they have none. I mean, I crashed their website with the original, sleepless script.
for i in range (1,50):
url = ("https://www.nekretnine.rs/"+ str(i))
page = urlopen(url)
soup = BeautifulSoup(page)
x = soup.find_all("div", {"class":"label-small"})
time.sleep (2)
for item in x:
number =item.find_all("form",attrs = {"span":"cell-number"})[0].text
data.append((number ))
print (data)
If the content you need is not on the home page, you should use beautifulsoup to find the links to other pages that you need, then post a request to get that html and look for the information there
For anyone stumbling here, I found the answer
https://webscraper.io/
This browser script has everything I needed, it's simple, no coding required, minus some regex if you need it
I'm making a system - mostly in Python with Scrapy - in which I can, basically, find information about a specific product. But the thing is that the request URL is massive huge, I got a clue that I should change some parts of it with variables to reach that specific product in which I would like to search for, but the URL has so many fields that I don't know, for sure, how to make it.
e.g: "https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt"
"demi+lovato+365+dias+do+ano" it's the book title, but I can see a lot of information on URL that I simply can't supply and of course, it changes from title to title. One solution I thought could be possible was to POST on search bar the title in which I was looking for and find it on result page but I don't know if it's the best approach since in fact, this is the first time I'll be working with web scraping.
Someone has some tip for how can I do that. All I could find was how to scrape all products for price comparison, scrape specific information about all these products and things like that but nothing about search for specific products.
Thanks for any contribs, this is very important for me and sorry about anything, I'm not a very present user and I'm not an English native speaker.
Feel free to make me any advice about user behavior, be better is always something I aim to.
You should use rule available in scrapy framework. This will help you to define how to navigate the site and its sub-site. Additionally you can configure other tags like span or div other than anchor tags to look for url of the link. By this way, additional query params in the link will be populated by the scrapy session as it emulate click on the hypelinks. If you skip the additional query params in the URL, there is a high chance that you will be blocked
How does scrapy use rules?
You don't need to follow that long link at all, often the different parameters are associated with your current session or settings/filters and you can keep only what you need.
Here is what I meant:
You can generate same result using these 2 urls:
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt
If both links are generating same results then that's it, otherwise you will definitely have to play with different parameters, you can't predict website behavior without actually doing the test and having a lot of parameters is an issue then try something like:
from urllib.parse import quote_plus
base_url = "https://www.amazon.com.br"
link = base_url + "/k=%s&adgrpid=%s&hvadid=%s" % ( quote_plus(title), '86887777368', '392971063429' )
I'm using scrapy/spyder to build my crawler, using BeautifulSoup as well.. I have been working on a crawler and believe we are at a point that it works as expected with the few individual pages we have scraped, so my next challenge is to scrape the same site, but ONLY pages that are specific to a high level category.
Only thing i have tried is using allowed_domain and start_urls, but when i did that, it was literally hitting every page it was finding and we want to control what pages we scrape so we have a clean list of information.
I understand that on each page there are links that take you outside of the page you are and can end up elsewhere on the site.. but what im trying to do is only focus on a few pages within each category
# allowed_domain = ['dickssportinggoods.com']
# start_urls = ['https://www.dickssportinggoods.com/c/mens-top-trends-gear']
You can either base your spider on Spider class and code the navigation yourself, or base it on the CrawlSpider class and use the rules to control which pages get visited. From the information you provided it seems that the later approach is more appropriate for your requirement. Check out the example to see how the rules work.
Is it possible to scrape links by the date associated with them? I'm trying to implement a daily run spider that saves article information to a database, but I don't want to re-scrape articles that I have already scraped before-- i.e yesterday's articles. I ran across this SO post asking the same thing and the scrapy-deltafetch plugin was suggested.
However, this relies on checking new requests against previously saved request fingerprints stored in a database. I'm assuming that if the daily scraping went on for a while, there would be a need for significant memory overhead on the database to store request fingerprints that have already been scraped.
So given a list of articles on a site like cnn.com, I want to scrape all the articles that have been published today 6/14/17, but once the scraper hits later articles with a date listed as 6/13/17, I want to close the spider and stop scraping. Is this kind of approach possible with scrapy? Given a page of articles, will a CrawlSpider start at the top of the page and scrape articles in order?
Just new to Scrapy, so not sure what to try. Any help would be greatly appreciated, thank you!
You can use a custom delta-fetch_key which checks the date and the title as the fingerprint.
from w3lib.url import url_query_parameter
...
def parse(self, response):
...
for product_url in response.css('a.product_listing'):
yield Request(
product_url,
meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
callback=self.parse_product_page
)
...
I compose a date using datetime.strptime(Item['dateinfo'], "%b-%d-%Y") from cobbled together information on the item of interest.
After that I just check it against a configured age in my settings, which can be overridden per invocation. You can issue a closespider exception when you find an age that is too old or you can set a finished flag and act on that in any of your other code.
No need for remembering stuff. I use this on a spider that I run daily and I simply set a 24hr age limit.
I am new to python, and I was looking into using scrapy to scrape specific elements on a page.
I need to fetch the Name and phone number listed on a members page.
This script will fetch the entire page, what can I add/change to fetch only those specific elements?
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["fali.org"]
start_urls = [
"http://www.fali.org/members/",
]
def parse(self, response):
filename = response.url.split("/?id=")[-2] + '%random%'
with open(filename, 'wb') as f:
f.write(response.body)
I cannot see a page:
http://www.fali.org/members/
instead it redirects to the home page.
That makes it impossible to give specifics.
Here is an example:
article_title = response.xpath("//td[#id='HpWelcome']/h2/text()").extract()
That parses "Florida Association of Licensed Investigators (FALI)" from their homepage.You can get browser plugins to help you figure out xpaths. XPath Helper on chrome makes it easy.
That said -- go through the tutorials posted above. Because you are gonna have more questions I'm sure and broad questions like this aren't taken well on stack-overflow.
As shark3y states in his answer the start_url gets redirected to the main page.
If you have read the documentation you should know that Scrapy starts scraping from the start_url and it does not know what you want to achieve.
In your case you need to start from http://www.fali.org/search/newsearch.asp which returns the search results for all members. Now you can set-up a Rule to go through the result list and call a parse_detail method for every member found and follow the links through the result pagination.
In the parse_detail method you can go through the site of the member and extract every information you need. I guess you do not need the whole site as you do in your example in your question because it would generate a lot of data on your computer -- and at the end you have to parse it anyway.