I have this url that the content of its response, contains some JSON data.
https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query=sadaf%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true
Everytime i paste this url in the browser with different queries, i get a nice JSON result. But in the scrapy or scrapy shell, i don't get any result. This is my scrapy spider class :
link = "https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true"
def start_requests(self):
files = [f for f in listdir('results/') if isfile(join('results/', f))]
for file in files:
with open('results/' + file, 'r', encoding="utf8") as tour_info:
tour = json.load(tour_info)
for hotel in tour["hotels"]:
yield scrapy.Request(self.link.format(hotel))
name = 'tripadvisor'
allowed_domains = ['tripadvisor.com']
def parse(self, response):
print(response.body)
For this code, in scrapy shell, i get this result:
b'{"normalized":{"query":""},"query":{},"results":[],"partial_content":false}'
In scrapy command line, by running the spider, i first got the Forbidden by robots.txt error for every url. I changed scrapy ROBOTSTXT_OBEY to False so it does not obey this file. Now i get [] for every request, but i should get a JSON object like this:
[
{
"urls":[
{
"url_type":"hotel",
"name":"Sadaf Hotel, Dubai, United Arab Emirates",
"type":"HOTEL",
"url":"\/Hotel_Review-g295424-d633008-Reviews-Sadaf_Hotel-Dubai_Emirate_of_Dubai.html"
}
],
.
.
.
Try removing the sessionID from the URL and maybe check how "unfriendly" your settings.py is. (Also see this blog)
But it could be way easier to use Wget, like wget 'https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&nearPages=true' -O results.json
Related
I have created a crawler using Scrapy.The crawler is crawling the website fetching the URL.
Technology Used:Python Scrapy
Issue:I am having duplication of URLs.
What I need the output to be:
I want the crawler to crawl the website and fetch the URL's but not crawl the duplicate URL's.
Sample Code:
I have added this code to my settings.py file.
DUPEFILTER_CLASS ='scrapy.dupefilter.RFPDupeFilter'
I ran the file its says module not found.
import scrapy
import os
import scrapy.dupefilters
class MySpider(scrapy.Spider):
name = 'feed_exporter_test'
# this is equivalent to what you would set in settings.py file
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI': 'inputLinks2.csv'
}
filePath='inputLinks2.csv'
if os.path.exists(filePath):
os.remove(filePath)
else:
print("Can not delete the file as it doesn't exists")
start_urls = ['https://www.mytravelexp.com/']
def parse(self, response):
titles = response.xpath("//a/#href").extract()
for title in titles:
yield {'title': title}
def __getid(self, url):
mm = url.split("&refer")[0] #or something like that
return mm
def request_seen(self, request):
fp = self.__getid(request.url)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
Please help!!
Scrapy filters duplicate requests by default.
I've created a script using scrapy implementing rotation of proxies within it to parse the address from few hundreds of similar links like this. I've supplied those links from a csv file within the script.
The script is doing fine until it encounters any response url like this https://www.bcassessment.ca//Property/UsageValidation. Given that once the script starts getting that link, it can't bypass that. FYI, I'm using meta properties containing lead_link to make use of original link instead of redirected link as a retry, so I should be able to bypass that barrier.
It doesn't happen when I use proxies within requests library. To be clearer - while using requests library, the script does encounter this page /Property/UsageValidation but bypass that successfully after few retries.
The spider is like:
class mySpider(scrapy.Spider):
name = "myspider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'stackoverflow_spider.middlewares.ProxiesMiddleware': 100,
}
}
def start_requests(self):
with open("output_main.csv","r") as f:
reader = csv.DictReader(f)
for item in list(reader):
lead_link = item['link']
yield scrapy.Request(lead_link,self.parse,meta={"lead_link":lead_link,"download_timeout":20}, dont_filter=True)
def parse(self,response):
address = response.css("h1#mainaddresstitle::text").get()
print(response.meta['proxy'],address)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
'LOG_LEVEL':'ERROR',
})
c.crawl(mySpider)
c.start()
How can I let the script not to encounter that page?
PS I've attached few of the links within a text file in case anyone wants to give a try.
To make session safe proxy implementation for scrapy app You
need to add additional cookiejar meta key to place where you assign proxy to request.meta like this:
....
yield scrapy.Request(url=link, meta = {"proxy":address, "cookiejar":address})
In this case scrapy cookiesMiddleware will create additional cookieSession for each proxy.
related specifics of scrapy proxy implementation mentioned in this answer
I'm new to Scrapy and I would like write backups of the HTML that s3. I found that by using the following, I could write a particular scrape's html:
settings.py
ITEM_PIPELINE = {
'scrapy.pipelines.files.S3FilesStore': 1
}
AWS_ACCESS_KEY_ID = os.environ['S3_MAIN_KEY']
AWS_SECRET_ACCESS_KEY= os.environ['S3_MAIN_SECRET']
FEED_FORMAT = "json"
FEED_URI=f's3://bucket_name/%(name)s/%(today)s.html'
And then in my scraper file:
def parse(self, response):
yield {'body': str(response.body, 'utf-8')}
However, I would like to write to a key that includes the url as a subfolder, for example:
FEED_URI=f's3://bucket_name/%(name)s/%(url)s/%(today)s.html'
How can I dynamically grab the url for the FEED_URI. I'm assuming that in
def start_requests(self):
urls = [
'http://www.example.com',
'http://www.example_1.com',
'http://www.example_2.com',
]
I have multiple urls. Also, is there anyway to write the raw HTML file, not nested in JSON? Thanks.
Feed exports are not meant to export to a different file per item.
For that, write an item pipeline instead.
I am working on Scrapy to scrap the website. And I want to extract only those items which have not been scraped in its previous run.
I am trying it on "https://www.ndtv.com/top-stories" website to extract only 1st headline if it is updated.
Below is my code:
import scrapy
from selenium import webdriver
from w3lib.url import url_query_parameter
class QuotesSpider(scrapy.Spider):
name = "test"
start_urls = [
'https://www.ndtv.com/top-stories',
]
def parse(self, response):
print ('testing')
print(response.url)
yield {
'heading': response.css('div.nstory_header a::text').extract_first(),
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
}
SPIDER_MIDDLEWARES = {
#'inc_crawling.middlewares.IncCrawlingSpiderMiddleware': 543,
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
'scrapy_deltafetch.DeltaFetch': 100,
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
'scrapylib.deltafetch.DeltaFetch': 100,
'inc_crawling.middlewares.deltafetch.DeltaFetch': 100,
}
COOKIES_ENABLED = True
COOKIES_DEBUG = True
DELTAFETCH_ENABLED = True
DELTAFETCH_DIR = '/home/administrator/apps/inc_crawling'
DOTSCRAPY_ENABLED = True
I have updated above code in setting.py file:
I am running the above code using "scrapy crawl test -o test.json" command and after each run .db file and test.json file gets updated.
So, my expectation is whenever the 1st headline is updated only then .db gets updated.
kindly help me if there is any better approach to extract updated headline.
a good way to implement this would be to override the DUPEFILTER_CLASS to check your database before doing the actual requests.
Scrapy uses a dupefilter class to avoid getting the same request twice, but it only works for running spiders.
For example i had a site "www.example.com"
Actually i want to scrape the html of this site by saving on to local system.
so for testing i saved that page on my desktop as example.html
Now i had written the spider code for this as below
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["example.html"]
def parse(self, response):
print response
hxs = HtmlXPathSelector(response)
But when i run the above code i am getting this error as below
ValueError: Missing scheme in request url: example.html
Finally my intension is to scrape the example.html file that consists of www.example.com html code saved in my local system
Can any one suggest me on how to assign that example.html file in start_urls
Thanks in advance
You can crawl a local file using an url of the following form:
file:///path/to/file.html
You can use the HTTPCacheMiddleware, which will give you the ability to to a spider run from cache. The documentation for the HTTPCacheMiddleware settings is located here.
Basically, adding the following settings to your settings.py will make it work:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire
This however requires to do an initial spider run from the web to populate the cache.
In scrapy, You can scrape local file using:
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["file:///path_of_directory/example.html"]
def parse(self, response):
print response
hxs = HtmlXPathSelector(response)
I suggest you check it using scrapy shell 'file:///path_of_directory/example.html'
Just to share the way that I like to do this scraping with local files:
import scrapy
import os
LOCAL_FILENAME = 'example.html'
LOCAL_FOLDER = 'html_files'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = [
f"file://{BASE_DIR}/{LOCAL_FOLDER}/{LOCAL_FILENAME}"
]
I'm using f-strings (python 3.6+)(https://www.python.org/dev/peps/pep-0498/), but you can change with %-formatting or str.format() as you prefer.
scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"
this works for me today on Windows 10.
I have to put the full path without the ////.
You can simple do
def start_requests(self):
yield Request(url='file:///path_of_directory/example.html')
If you view source code of scrapy Request for example github . You can understand what scrapy send request to http server and get needed page in response from server. Your filesystem is not http server. For testing purpose with scrapy you must setup http server. And then you can assign urls to scrapy like
http://127.0.0.1/example.html