Scrapy - How to write to a custom FEED_URI - python

I'm new to Scrapy and I would like write backups of the HTML that s3. I found that by using the following, I could write a particular scrape's html:
settings.py
ITEM_PIPELINE = {
'scrapy.pipelines.files.S3FilesStore': 1
}
AWS_ACCESS_KEY_ID = os.environ['S3_MAIN_KEY']
AWS_SECRET_ACCESS_KEY= os.environ['S3_MAIN_SECRET']
FEED_FORMAT = "json"
FEED_URI=f's3://bucket_name/%(name)s/%(today)s.html'
And then in my scraper file:
def parse(self, response):
yield {'body': str(response.body, 'utf-8')}
However, I would like to write to a key that includes the url as a subfolder, for example:
FEED_URI=f's3://bucket_name/%(name)s/%(url)s/%(today)s.html'
How can I dynamically grab the url for the FEED_URI. I'm assuming that in
def start_requests(self):
urls = [
'http://www.example.com',
'http://www.example_1.com',
'http://www.example_2.com',
]
I have multiple urls. Also, is there anyway to write the raw HTML file, not nested in JSON? Thanks.

Feed exports are not meant to export to a different file per item.
For that, write an item pipeline instead.

Related

Scrape multiple links from a json file

I am trying to scrape multiple links that i have previously scraped and saved in a json file.
this works so far but i dont want to just scrape that one url but all from my json file.
import scrapy
import json
class RatingSpider(scrapy.Spider):
name = "rating"
def start_requests(self):
urls = [
'https://www.darkpattern.games/game/3478/0/ragnarok-m-eternal-love-rom-.html'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for rating in response.css('div.score_box'):
yield {
'reported': rating.css('div.score_heading *::text').extract()
}
the json file looks like this
[
{
"title": [
"\n\t\t\t\t\t\t",
"Ragnarok M: Eternal Love(ROM)",
"\n\t\t\t\t\t\t",
"\t\t\t\t\t\t",
"The classic adventure returns",
"\n\t\t\t\t\t"
],
"link": [
"/game/3478/0/ragnarok-m-eternal-love-rom-.html"
],
"rating": [
"\n\t\t\t\t\t\t",
"\n\t\t\t\t\t\t",
"-4.68",
"\n\t\t\t\t\t"
]
}
]
any suggestions on how to do this?
I don't see in your example where you are reading from your json file. You would need to do something like this:
with open("your json file", "r") as f:
jsonlist = json.load(f)
for i in range(len(jsonlist)):
url = jsonlist[i]["link"][0]
do something with url - run request or store in list, etc. Also, Your sample json contains a relative url so I assume the rest of the file is the same and the base url is https://www.darkpattern.games so you would need to concatenate the base url - https://www.darkpattern.games - and the relative urls prior to running the requests.

Can't get rid of a problematic page even when using rotation of proxies within scrapy

I've created a script using scrapy implementing rotation of proxies within it to parse the address from few hundreds of similar links like this. I've supplied those links from a csv file within the script.
The script is doing fine until it encounters any response url like this https://www.bcassessment.ca//Property/UsageValidation. Given that once the script starts getting that link, it can't bypass that. FYI, I'm using meta properties containing lead_link to make use of original link instead of redirected link as a retry, so I should be able to bypass that barrier.
It doesn't happen when I use proxies within requests library. To be clearer - while using requests library, the script does encounter this page /Property/UsageValidation but bypass that successfully after few retries.
The spider is like:
class mySpider(scrapy.Spider):
name = "myspider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'stackoverflow_spider.middlewares.ProxiesMiddleware': 100,
}
}
def start_requests(self):
with open("output_main.csv","r") as f:
reader = csv.DictReader(f)
for item in list(reader):
lead_link = item['link']
yield scrapy.Request(lead_link,self.parse,meta={"lead_link":lead_link,"download_timeout":20}, dont_filter=True)
def parse(self,response):
address = response.css("h1#mainaddresstitle::text").get()
print(response.meta['proxy'],address)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
'LOG_LEVEL':'ERROR',
})
c.crawl(mySpider)
c.start()
How can I let the script not to encounter that page?
PS I've attached few of the links within a text file in case anyone wants to give a try.
To make session safe proxy implementation for scrapy app You
need to add additional cookiejar meta key to place where you assign proxy to request.meta like this:
....
yield scrapy.Request(url=link, meta = {"proxy":address, "cookiejar":address})
In this case scrapy cookiesMiddleware will create additional cookieSession for each proxy.
related specifics of scrapy proxy implementation mentioned in this answer

scrape an api result page with scrapy

I have this url that the content of its response, contains some JSON data.
https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query=sadaf%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true
Everytime i paste this url in the browser with different queries, i get a nice JSON result. But in the scrapy or scrapy shell, i don't get any result. This is my scrapy spider class :
link = "https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true"
def start_requests(self):
files = [f for f in listdir('results/') if isfile(join('results/', f))]
for file in files:
with open('results/' + file, 'r', encoding="utf8") as tour_info:
tour = json.load(tour_info)
for hotel in tour["hotels"]:
yield scrapy.Request(self.link.format(hotel))
name = 'tripadvisor'
allowed_domains = ['tripadvisor.com']
def parse(self, response):
print(response.body)
For this code, in scrapy shell, i get this result:
b'{"normalized":{"query":""},"query":{},"results":[],"partial_content":false}'
In scrapy command line, by running the spider, i first got the Forbidden by robots.txt error for every url. I changed scrapy ROBOTSTXT_OBEY to False so it does not obey this file. Now i get [] for every request, but i should get a JSON object like this:
[
{
"urls":[
{
"url_type":"hotel",
"name":"Sadaf Hotel, Dubai, United Arab Emirates",
"type":"HOTEL",
"url":"\/Hotel_Review-g295424-d633008-Reviews-Sadaf_Hotel-Dubai_Emirate_of_Dubai.html"
}
],
.
.
.
Try removing the sessionID from the URL and maybe check how "unfriendly" your settings.py is. (Also see this blog)
But it could be way easier to use Wget, like wget 'https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&nearPages=true' -O results.json

Export csv file from scrapy (not via command line)

I successfully tried to export my items into a csv file from the command line like:
scrapy crawl spiderName -o filename.csv
My question is:
What is the easiest solution to do the same in the code? I need this as i extract the filename from another file.
End scenario should be, that i call
scrapy crawl spiderName
and it writes the items into filename.csv
Why not use an item pipeline?
WriteToCsv.py
import csv
from YOUR_PROJECT_NAME_HERE import settings
def write_to_csv(item):
writer = csv.writer(open(settings.csv_file_path, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
class WriteToCsv(object):
def process_item(self, item, spider):
write_to_csv(item)
return item
settings.py
ITEM_PIPELINES = { 'project.pipelines_path.WriteToCsv.WriteToCsv' : A_NUMBER_HIGHER_THAN_ALL_OTHER_PIPELINES}
csv_file_path = PATH_TO_CSV
If you wanted items to be written to separate csv for separate spiders you could give your spider a CSV_PATH field. Then in your pipeline use your spiders field instead of path from setttigs.
This works I tested it in my project.
HTH
http://doc.scrapy.org/en/latest/topics/item-pipeline.html
There an updated way to save your file in scrapy which is using "FEEDS"
class mySpider(scrapy.Spider):
name = "myProject"
custom_settings = {
"FEEDS":{"fileName.csv":{"format":"csv"}},
}
That's what Feed Exports are for:
http://doc.scrapy.org/en/latest/topics/feed-exports.html
One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating a “export file” with the scraped data (commonly called “export feed”) to be consumed by other systems.
Scrapy provides this functionality out of the box with the Feed Exports, which allows you to generate a feed with the scraped items, using multiple serialization formats and storage backends.
Up-to-date answer is:
Use build-in exporter. You can set filename as key. Config may look like:
filename = 'export'
class mySpider(scrapy.Spider):
custom_settings = {
'FEEDS': {
f'{filename}.csv': {
'format': 'csv',
'overwrite': True
}
}
}
Documentation: https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS

scraping the file with html saved in local system

For example i had a site "www.example.com"
Actually i want to scrape the html of this site by saving on to local system.
so for testing i saved that page on my desktop as example.html
Now i had written the spider code for this as below
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["example.html"]
def parse(self, response):
print response
hxs = HtmlXPathSelector(response)
But when i run the above code i am getting this error as below
ValueError: Missing scheme in request url: example.html
Finally my intension is to scrape the example.html file that consists of www.example.com html code saved in my local system
Can any one suggest me on how to assign that example.html file in start_urls
Thanks in advance
You can crawl a local file using an url of the following form:
file:///path/to/file.html
You can use the HTTPCacheMiddleware, which will give you the ability to to a spider run from cache. The documentation for the HTTPCacheMiddleware settings is located here.
Basically, adding the following settings to your settings.py will make it work:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire
This however requires to do an initial spider run from the web to populate the cache.
In scrapy, You can scrape local file using:
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["file:///path_of_directory/example.html"]
def parse(self, response):
print response
hxs = HtmlXPathSelector(response)
I suggest you check it using scrapy shell 'file:///path_of_directory/example.html'
Just to share the way that I like to do this scraping with local files:
import scrapy
import os
LOCAL_FILENAME = 'example.html'
LOCAL_FOLDER = 'html_files'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = [
f"file://{BASE_DIR}/{LOCAL_FOLDER}/{LOCAL_FILENAME}"
]
I'm using f-strings (python 3.6+)(https://www.python.org/dev/peps/pep-0498/), but you can change with %-formatting or str.format() as you prefer.
scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"
this works for me today on Windows 10.
I have to put the full path without the ////.
You can simple do
def start_requests(self):
yield Request(url='file:///path_of_directory/example.html')
If you view source code of scrapy Request for example github . You can understand what scrapy send request to http server and get needed page in response from server. Your filesystem is not http server. For testing purpose with scrapy you must setup http server. And then you can assign urls to scrapy like
http://127.0.0.1/example.html

Categories

Resources