How to use one bot for different websites - python

I want to scrape 2 different website. One of them is plain html and the other one javascript (for which I need splash to scrape it).
So I have several questions about it:
Can I scrape two different types of websites with only one bot (html and javascript one)? I did two html websites before and it worked but I wonder if this also works if one of them is javascript
If the first question is possible can I export json separately? Like for url1 output1.json, for url2 output2.json?
As you can see from my code, code need edited and I dont know how can I do that when two different types of websites need to be scraped.
Is there any tool of scrapy to compare json? (The two different websites have almost the same content. I want make output1.json the base and check if some value are different in output2.json or not.
My code:
class MySpider(scrapy.Spider):
name = 'mybot'
allowed_domains = ['url1','url2']
def start_requests(self):
urls = (
(self.parse1, 'url1'),
(self.parse2, 'url2'),
)
for callbackfunc, url in urls:
yield scrapy.Request(url, callback=callbackfunc)
#In fact url2 must for javascript website so I need clearly splash here
def parse1(self, response):
pass
def parse2(self,response):
pass

Yes, you can scrape more than one site with the same spider, but it doesn't make sense if they are too different. The way to do that you have already figured out: allowed_domains and start_requests (or start_urls). However, exporting to different files won't be straightforward. You will have to write your export code.
IMHO having one spider per site is the way to go. If they share some code, you can have a BaseSpider class from where your spiders can inherit.
And regarding the javascript site you mentioned, are you sure you can not request its API directly?

Related

what is the best way to add multiple Start URLs in Scrapy Crawlspider with different allowed domains?

currently, I'm using the below code to add multiple start URLs (50K).
start_urls=[]
allowed_domains=[]
df=pd.read_excel("xyz.xlsx")
for url in df['URL']:
start_urls.append(parent_url)
allowed_domains.append(tldextract.extract(parent_url).registered_domain)
But, I think this will not keep the different allowed domains for each website. and I also want to pass some meta information with each link.

how to traverse only certain areas of a site? Basically stay within certain pages?

I'm using scrapy/spyder to build my crawler, using BeautifulSoup as well.. I have been working on a crawler and believe we are at a point that it works as expected with the few individual pages we have scraped, so my next challenge is to scrape the same site, but ONLY pages that are specific to a high level category.
Only thing i have tried is using allowed_domain and start_urls, but when i did that, it was literally hitting every page it was finding and we want to control what pages we scrape so we have a clean list of information.
I understand that on each page there are links that take you outside of the page you are and can end up elsewhere on the site.. but what im trying to do is only focus on a few pages within each category
# allowed_domain = ['dickssportinggoods.com']
# start_urls = ['https://www.dickssportinggoods.com/c/mens-top-trends-gear']
You can either base your spider on Spider class and code the navigation yourself, or base it on the CrawlSpider class and use the rules to control which pages get visited. From the information you provided it seems that the later approach is more appropriate for your requirement. Check out the example to see how the rules work.

Is it possible for scrapy to navigate links before actually scraping data?

I've been working through a few scrapy tutorials and I have a question (I'm very new to this, so I apologize if this is a dumb question). Most of what I've seen so far involved:
1) feeding a starting url to scrapy
2) telling scrapy what parts of the page to grab
3) telling scrapy how to find the "next" page to scrape from
What I'm wondering is - am I able to scrape data using scrapy when the data itself isn't on the start page? For example, I have a link that goes to a forum. The forum contains links to several subforums. Each subforum has links to several threads. Each thread contains several messages (possibly over multiple pages). The messages are what I ultimately want to scrape. Is it possible to do this and use only the initial link to the forum? Is it possible for scrapy to navigate through every subforum, and every thread and then start scraping?
Yes, you can navigate without scraping data, though you will need to extract the links for navigation with either xpath or css or CrawlSpider rules. These links can be used just for navigation and don't need to be loaded into items.
There's no requirement that you load something into an item from every page you visit. Consider a scenario where you need to authenticate past login to get to data that you want to scrape. No need to scrape/pipeline/write any data from the login page.
For your purposes:
def start_requests(self):
forum_url = <spam>
yield scrapy.Request(url=forum_url, callback=self.parse_forum)
def parse_forum(self, response):
#get the urls
for u in subforum_urls:
yield scrapy.Request(url=u, callback=parse_subforum)
def parse_subforum(self, response):
#get the other urls
for u in thread_urls:
yield scrapy.Request(url=u, callback=parse_thread)
def parse_thread(self, response):
#get the data you want
yield <the data>

How to code a scrapy to grab differently formatted html tables

I've used scrapy before, but only scraping information from one site. I want to use scrapy to grab information from directories on different sites. On each of these sites the information is stored in a simple html table, with the same titles. How do I calibrate scrapy to grab data from each html table even if the table classes may differ from site to site? On a larger scale, what I'm asking is how to use scrapy when I want to hit different websites that may be formatted differently. I'll include below pictures of the html source and xpaths of several of the sites.
The fields of the table, more or less the same for each site directory
The xpath for site 1 for the name column
the xpath for site 2 for the name column
general html formatting of site 1, with phone number blurred out
general html formatting of site 2
General formatting for a third site, which is different than the first 2 but still in a table with 4 columns
Yes - it's a bit of a pain to have to write a spider for every site, especially if there are 100's and the Items are the same for all of them.
If it fits your need, you might like to store XPaths for each site on a file e.g. a csv file. Then you can fetch URLs and expressions from the csv and use them in your spider (adapted from here):
def start_requests(self):
with open(getattr(self, "file", "todo.csv"), "rU") as f:
reader = csv.DictReader(f)
for line in reader:
request = Request(line.pop('url'))
request.meta['fields'] = line
yield request
def parse(self, response):
xpath = response.meta['fields']['tablexpath']
... use xpath it to extract your table
If you need to release your spider to e.g. scrapyd or scrapinghub, you will need to package your .csv file along with your code. To do so you will have to edit the setup.py that shub deploy or scrapyd-client generate and add:
setup(
...
package_data={'myproject': ['my_csv.csv']}
)
Also in your spider, instead of opening your file directly with open, you should use this:
from pkg_resources import resource_stream
f = resource_stream('myproject', 'my_csv.csv')
Here's an example. If you don't deploy your spider, just ignore the above. If you do this will save you a few hours of debugging.
I did that by creating a scrapy project with one spider per site and using the same item class for all the different spiders.

Scraping specific elements from page

I am new to python, and I was looking into using scrapy to scrape specific elements on a page.
I need to fetch the Name and phone number listed on a members page.
This script will fetch the entire page, what can I add/change to fetch only those specific elements?
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["fali.org"]
start_urls = [
"http://www.fali.org/members/",
]
def parse(self, response):
filename = response.url.split("/?id=")[-2] + '%random%'
with open(filename, 'wb') as f:
f.write(response.body)
I cannot see a page:
http://www.fali.org/members/
instead it redirects to the home page.
That makes it impossible to give specifics.
Here is an example:
article_title = response.xpath("//td[#id='HpWelcome']/h2/text()").extract()
That parses "Florida Association of Licensed Investigators (FALI)" from their homepage.You can get browser plugins to help you figure out xpaths. XPath Helper on chrome makes it easy.
That said -- go through the tutorials posted above. Because you are gonna have more questions I'm sure and broad questions like this aren't taken well on stack-overflow.
As shark3y states in his answer the start_url gets redirected to the main page.
If you have read the documentation you should know that Scrapy starts scraping from the start_url and it does not know what you want to achieve.
In your case you need to start from http://www.fali.org/search/newsearch.asp which returns the search results for all members. Now you can set-up a Rule to go through the result list and call a parse_detail method for every member found and follow the links through the result pagination.
In the parse_detail method you can go through the site of the member and extract every information you need. I guess you do not need the whole site as you do in your example in your question because it would generate a lot of data on your computer -- and at the end you have to parse it anyway.

Categories

Resources