Scrape multiple links from a json file - python

I am trying to scrape multiple links that i have previously scraped and saved in a json file.
this works so far but i dont want to just scrape that one url but all from my json file.
import scrapy
import json
class RatingSpider(scrapy.Spider):
name = "rating"
def start_requests(self):
urls = [
'https://www.darkpattern.games/game/3478/0/ragnarok-m-eternal-love-rom-.html'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for rating in response.css('div.score_box'):
yield {
'reported': rating.css('div.score_heading *::text').extract()
}
the json file looks like this
[
{
"title": [
"\n\t\t\t\t\t\t",
"Ragnarok M: Eternal Love(ROM)",
"\n\t\t\t\t\t\t",
"\t\t\t\t\t\t",
"The classic adventure returns",
"\n\t\t\t\t\t"
],
"link": [
"/game/3478/0/ragnarok-m-eternal-love-rom-.html"
],
"rating": [
"\n\t\t\t\t\t\t",
"\n\t\t\t\t\t\t",
"-4.68",
"\n\t\t\t\t\t"
]
}
]
any suggestions on how to do this?

I don't see in your example where you are reading from your json file. You would need to do something like this:
with open("your json file", "r") as f:
jsonlist = json.load(f)
for i in range(len(jsonlist)):
url = jsonlist[i]["link"][0]
do something with url - run request or store in list, etc. Also, Your sample json contains a relative url so I assume the rest of the file is the same and the base url is https://www.darkpattern.games so you would need to concatenate the base url - https://www.darkpattern.games - and the relative urls prior to running the requests.

Related

How to export scrapped data using FEEDS/FEED EXPORTS in scrapy

I'm new to webscraping/scrapy and python
Scrapy version: Scrapy 2.5.1
OS: windows
IDE: pycharm
I am trying to use FEEDS option in scrapy to automatically export the scrapped data from a website to download into excel
Tried following solution but didn't work stackoverflow solution not sure what i'm doing wrong here am i missing something?
i also tried to add the same in my settings.py file after commenting custom_settings in my spider class as per example provided in documentation: https://docs.scrapy.org/en/latest/topics/feed-exports.html?highlight=feed#feeds
for now i achieved my requirement using spider_closed (signal) to write data to CSV by storing all the scraped items data in a array called result
class SpiderFC(scrapy.Spider):
name = "FC"
start_urls = [
url,
]
custom_setting = {"FEEDS": {r"C:\Users\rreddy\PycharmProjects\fcdc\webscrp\outputfinal.csv": {"format": "csv", "overwrite": True}}}
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SpiderFC, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def __init__(self, name=None):
super().__init__(name)
self.count = None
def parse(self, response, **kwargs):
# each item scrapped from parent page has links where the actual data need to be scrapped so i follow each link and scrape data
yield response.follow(notice_href_follow, callback=self.parse_item,
meta={'item': item, 'index': index, 'next_page': next_page})
def parse_item(self, response):
# logic for items to scrape goes here
# they are saved to temp list and appended to result array and then temp list is cleared
result.append(it) # result data is used at the end to write to csv
item.clear()
if next_page:
yield next(self.follow_next(response, next_page))
def follow_next(self, response, next_page):
next_page_url = urljoin(url, next_page[0])
yield response.follow(next_page_url, callback=self.parse)
spider closed signal
def spider_closed(self, spider):
with open(output_path, mode="a", newline='') as f:
writer = csv.writer(f)
for v in result:
writer.writerow([v["city"]])
when all data is scraped and all requests are completed spider_closed signal will write the data to a csv but i'm trying to avoid this logic or code and use inbuilt exporter from scrapy but I'm having trouble in exporting the data
Check your path. If you are on windows then provide the full path in the custom_settings e.g. as below
custom_settings = {
"FEEDS":{r"C:\Users\Name\Path\To\outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}
If you are on linux or MAC then provide the path as below:
custom_settings = {
"FEEDS":{r"/Path/to/folder/fcdc/webscrp/outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}
Alternatively provide the relative path as below which will create a folder structure of fcdc>>webscrp>>outputfinal.csv in the directory from which the spider is run from.
custom_settings = {
"FEEDS":{r"./fcdc/webscrp/outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}

Scrapy - How to write to a custom FEED_URI

I'm new to Scrapy and I would like write backups of the HTML that s3. I found that by using the following, I could write a particular scrape's html:
settings.py
ITEM_PIPELINE = {
'scrapy.pipelines.files.S3FilesStore': 1
}
AWS_ACCESS_KEY_ID = os.environ['S3_MAIN_KEY']
AWS_SECRET_ACCESS_KEY= os.environ['S3_MAIN_SECRET']
FEED_FORMAT = "json"
FEED_URI=f's3://bucket_name/%(name)s/%(today)s.html'
And then in my scraper file:
def parse(self, response):
yield {'body': str(response.body, 'utf-8')}
However, I would like to write to a key that includes the url as a subfolder, for example:
FEED_URI=f's3://bucket_name/%(name)s/%(url)s/%(today)s.html'
How can I dynamically grab the url for the FEED_URI. I'm assuming that in
def start_requests(self):
urls = [
'http://www.example.com',
'http://www.example_1.com',
'http://www.example_2.com',
]
I have multiple urls. Also, is there anyway to write the raw HTML file, not nested in JSON? Thanks.
Feed exports are not meant to export to a different file per item.
For that, write an item pipeline instead.

scrape an api result page with scrapy

I have this url that the content of its response, contains some JSON data.
https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query=sadaf%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true
Everytime i paste this url in the browser with different queries, i get a nice JSON result. But in the scrapy or scrapy shell, i don't get any result. This is my scrapy spider class :
link = "https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true"
def start_requests(self):
files = [f for f in listdir('results/') if isfile(join('results/', f))]
for file in files:
with open('results/' + file, 'r', encoding="utf8") as tour_info:
tour = json.load(tour_info)
for hotel in tour["hotels"]:
yield scrapy.Request(self.link.format(hotel))
name = 'tripadvisor'
allowed_domains = ['tripadvisor.com']
def parse(self, response):
print(response.body)
For this code, in scrapy shell, i get this result:
b'{"normalized":{"query":""},"query":{},"results":[],"partial_content":false}'
In scrapy command line, by running the spider, i first got the Forbidden by robots.txt error for every url. I changed scrapy ROBOTSTXT_OBEY to False so it does not obey this file. Now i get [] for every request, but i should get a JSON object like this:
[
{
"urls":[
{
"url_type":"hotel",
"name":"Sadaf Hotel, Dubai, United Arab Emirates",
"type":"HOTEL",
"url":"\/Hotel_Review-g295424-d633008-Reviews-Sadaf_Hotel-Dubai_Emirate_of_Dubai.html"
}
],
.
.
.
Try removing the sessionID from the URL and maybe check how "unfriendly" your settings.py is. (Also see this blog)
But it could be way easier to use Wget, like wget 'https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&nearPages=true' -O results.json

Scrapy JSON output - values empty

I would like to crawl a set of web pages using scrapy. However, when I try to write some values into the json file, those fields don't show up.
Here is my code:
import scrapy
class LLPubs (scrapy.Spider):
name = "linlinks"
start_urls = [
'http://www.linnaeuslink.org/records/record/1',
'http://www.linnaeuslink.org/records/record/2',
]
def parse(self, response):
for container in response.css('div.item'):
yield {
'text': container.css('div.field.soulsbyNo .value span::text').extract(),
'uniformtitle': container.css('div.field.uniformTitle .value span::text').extract(),
'title': container.css('div.field.title .value span::text').extract(),
'opac': container.css('div.field.localControlNo .value span::text').extract(),
'url': container.css('div#digitalLinks li a').extract(),
'partner': container.css('div.logoContainer img:first-child').xpath('#src').extract(),
}
And an example of my output:
{
"text": ["Soulsby no. 46(1)"],
"uniformtitle": ["Systema naturae"],
"title": ["Caroli Linn\u00e6i ... Systema natur\u00e6\nin quo natur\u00e6 regna tria, secundum classes, ordines, genera, species, systematice proponuntur."],
"opac": ["002178079"],
"url": [],
"partner": []
},
I am hoping I am doing something silly and easy to fix! Both of the paths I am using for "url" and "partner" were working from here:
scrapy shell 'http://www.linnaeuslink.org/records/record/1'
So, I just don't know what I am missing.
Oh, and exporting to json by using this command for now:
scrapy crawl linlinks -o quotes.json
Thanks for your help!
The problem seems to be that those selectors are not "findable" inside any div.item you probably have validated them without the response.css('div.item') to replicate what you used in the shell just replace the container.css by response.css for the url and partner keys.

Open json file link through a code

I'm creating an addon and I'm modifying some functions that come within a py file.
What I intend to do is the following, I have this code:
def channellist():
return json.loads(openfile('lib.json',pastafinal=os.path.join(tugapath,'resources')))
This code gives access to a lib.json file that is inside the tugapath folder in the resources subfolder. What I did was put the lib.json file in the dropbox and wanted to replace it with the dropbox link from the lib.json file instead of calling the folders.
I tried to change the code but without success.
def channellist():
return json.loads(openfile('lib.json',pastafinal=os.path.join("https://www.dropbox.com/s/sj1246qtiodm6qd/lib.json?dl=1')))
If someone can help me, I'm grateful!
Thank you first.
Given that your link holds valid json - which is not the case with the content you posted - you could use requests.
If the content at dropbox looked liked this:
{"tv":
{"epg": "tv",
"streams":
[{"url": "http://topchantv.net:3456/live/Stalker/Stalker/838.m3u8",
"name": "IPTV",
"resolve": False,
"visible": True}],
"name": "tv",
"thumb": "thumb_tv.png"
}
}
Then fetching the content would be like this
import requests
url = 'https://www.dropbox.com/s/sj1246qtiodm6qd/lib.json?dl=1'
r = requests.get(url)
json_object = r.json()
So if you needed it inside a function, I guess you'd input the url and return the json like so:
def channellist(url):
r = requests.get(url)
json_object = r.json()
return json_object

Categories

Resources