I am new to python, and I was looking into using scrapy to scrape specific elements on a page.
I need to fetch the Name and phone number listed on a members page.
This script will fetch the entire page, what can I add/change to fetch only those specific elements?
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["fali.org"]
start_urls = [
"http://www.fali.org/members/",
]
def parse(self, response):
filename = response.url.split("/?id=")[-2] + '%random%'
with open(filename, 'wb') as f:
f.write(response.body)
I cannot see a page:
http://www.fali.org/members/
instead it redirects to the home page.
That makes it impossible to give specifics.
Here is an example:
article_title = response.xpath("//td[#id='HpWelcome']/h2/text()").extract()
That parses "Florida Association of Licensed Investigators (FALI)" from their homepage.You can get browser plugins to help you figure out xpaths. XPath Helper on chrome makes it easy.
That said -- go through the tutorials posted above. Because you are gonna have more questions I'm sure and broad questions like this aren't taken well on stack-overflow.
As shark3y states in his answer the start_url gets redirected to the main page.
If you have read the documentation you should know that Scrapy starts scraping from the start_url and it does not know what you want to achieve.
In your case you need to start from http://www.fali.org/search/newsearch.asp which returns the search results for all members. Now you can set-up a Rule to go through the result list and call a parse_detail method for every member found and follow the links through the result pagination.
In the parse_detail method you can go through the site of the member and extract every information you need. I guess you do not need the whole site as you do in your example in your question because it would generate a lot of data on your computer -- and at the end you have to parse it anyway.
Related
I started to look at Scrapy and want to have one spider to get some prices of MTG Cards.
First I don't know if I'm 100% correct to use the link that select all the cards available in the beginning of the function:
name = 'bazarmtgbot'
allowed_domains = ['www.bazardebagda.com.br']
start_urls = ['https://bazardebagda.com.br/?view=ecom/itens&tcg=1&txt_estoque=1&txt_limit=160&txt_order=1&txt_extras=all&page=1']
1 - Should I use this kind of start_urls?
2 - Then, if you access the site, I could not find how to get the unit and price of the card, they are blank DIV's...
I got the name using:
titles = response.css(".itemNameP.ellipsis::text").extract()
3 - I couldn't find how can I do the pagination of this site to get the next set of items unit/prices. Do I need to copy the start_urls N times?
(and 3) That would be fine to start on a given page. When scraping you can queue additional URLs to scrape by looking for something like the "next page" button, scraping that link, and yield'ing a scrapy.Request that you want to follow-up on. See this part of the Scrapy tutorial
That site may be using a bunch of techniques to thwart price scraping: the blank price divs are loading an image like the below and chopping parts of it up with gibberish CSS class names to form the number. You may need to do some OCR or find an alternative method. Bear in mind that because they're going to that degree, there might be other anti-scraping countermeasures.
I'm using scrapy/spyder to build my crawler, using BeautifulSoup as well.. I have been working on a crawler and believe we are at a point that it works as expected with the few individual pages we have scraped, so my next challenge is to scrape the same site, but ONLY pages that are specific to a high level category.
Only thing i have tried is using allowed_domain and start_urls, but when i did that, it was literally hitting every page it was finding and we want to control what pages we scrape so we have a clean list of information.
I understand that on each page there are links that take you outside of the page you are and can end up elsewhere on the site.. but what im trying to do is only focus on a few pages within each category
# allowed_domain = ['dickssportinggoods.com']
# start_urls = ['https://www.dickssportinggoods.com/c/mens-top-trends-gear']
You can either base your spider on Spider class and code the navigation yourself, or base it on the CrawlSpider class and use the rules to control which pages get visited. From the information you provided it seems that the later approach is more appropriate for your requirement. Check out the example to see how the rules work.
I want to scrape 2 different website. One of them is plain html and the other one javascript (for which I need splash to scrape it).
So I have several questions about it:
Can I scrape two different types of websites with only one bot (html and javascript one)? I did two html websites before and it worked but I wonder if this also works if one of them is javascript
If the first question is possible can I export json separately? Like for url1 output1.json, for url2 output2.json?
As you can see from my code, code need edited and I dont know how can I do that when two different types of websites need to be scraped.
Is there any tool of scrapy to compare json? (The two different websites have almost the same content. I want make output1.json the base and check if some value are different in output2.json or not.
My code:
class MySpider(scrapy.Spider):
name = 'mybot'
allowed_domains = ['url1','url2']
def start_requests(self):
urls = (
(self.parse1, 'url1'),
(self.parse2, 'url2'),
)
for callbackfunc, url in urls:
yield scrapy.Request(url, callback=callbackfunc)
#In fact url2 must for javascript website so I need clearly splash here
def parse1(self, response):
pass
def parse2(self,response):
pass
Yes, you can scrape more than one site with the same spider, but it doesn't make sense if they are too different. The way to do that you have already figured out: allowed_domains and start_requests (or start_urls). However, exporting to different files won't be straightforward. You will have to write your export code.
IMHO having one spider per site is the way to go. If they share some code, you can have a BaseSpider class from where your spiders can inherit.
And regarding the javascript site you mentioned, are you sure you can not request its API directly?
I've been working through a few scrapy tutorials and I have a question (I'm very new to this, so I apologize if this is a dumb question). Most of what I've seen so far involved:
1) feeding a starting url to scrapy
2) telling scrapy what parts of the page to grab
3) telling scrapy how to find the "next" page to scrape from
What I'm wondering is - am I able to scrape data using scrapy when the data itself isn't on the start page? For example, I have a link that goes to a forum. The forum contains links to several subforums. Each subforum has links to several threads. Each thread contains several messages (possibly over multiple pages). The messages are what I ultimately want to scrape. Is it possible to do this and use only the initial link to the forum? Is it possible for scrapy to navigate through every subforum, and every thread and then start scraping?
Yes, you can navigate without scraping data, though you will need to extract the links for navigation with either xpath or css or CrawlSpider rules. These links can be used just for navigation and don't need to be loaded into items.
There's no requirement that you load something into an item from every page you visit. Consider a scenario where you need to authenticate past login to get to data that you want to scrape. No need to scrape/pipeline/write any data from the login page.
For your purposes:
def start_requests(self):
forum_url = <spam>
yield scrapy.Request(url=forum_url, callback=self.parse_forum)
def parse_forum(self, response):
#get the urls
for u in subforum_urls:
yield scrapy.Request(url=u, callback=parse_subforum)
def parse_subforum(self, response):
#get the other urls
for u in thread_urls:
yield scrapy.Request(url=u, callback=parse_thread)
def parse_thread(self, response):
#get the data you want
yield <the data>
I have been using scrapy for a personal project. My problem is very similar to the question asked on the following page:
Scrapy: Follow link to get additional Item data?
The page I am scraping is the following:
http://www.tennisinsight.com/player_activity.php?player_id=51
This page has a list of matches in this form for eg:
Round of 16 Def. Ivan Dodig(37,2.41) (CRO) 6-3 6-3 Recap Match Stats $1.043
I have currently written in scrapy code that opens every link on the page which has the "Match Stats" link, and scrapes data on that page into an individual record
In addition to this, I want to scrape the "Odds" column (which is the $1.043 above) and add this data to the record.
I have searched for an answer and it seems that I have to use the Request meta field and pass this data along to the parse method. However, I have a problem because I am struggling to incorporate it into my code. The answer from the stackoverflow link I linked above is "To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter."
This makes perfect sense, however, the URLs that I scrape are in the rules, so I dont know how to extract the required data.
Here is part of my code so far which will hopefully better explain my problem.
rules = (
Rule(SgmlLinkExtractor(allow=r"match_stats_popup.php\?matchID=\d+",
restrict_xpaths='//td[#class="matchStyle"]',
tags='a', attrs='href', process_value=getPopupLink), callback='parse_match', follow=True)
The parse_match function parses the match stats into one item.
So what happens is that each of these match stats links are opened up, and there is no way for me to access the main page's Odds column.
Any help will be much appreciated.