I would like to crawl a set of web pages using scrapy. However, when I try to write some values into the json file, those fields don't show up.
Here is my code:
import scrapy
class LLPubs (scrapy.Spider):
name = "linlinks"
start_urls = [
'http://www.linnaeuslink.org/records/record/1',
'http://www.linnaeuslink.org/records/record/2',
]
def parse(self, response):
for container in response.css('div.item'):
yield {
'text': container.css('div.field.soulsbyNo .value span::text').extract(),
'uniformtitle': container.css('div.field.uniformTitle .value span::text').extract(),
'title': container.css('div.field.title .value span::text').extract(),
'opac': container.css('div.field.localControlNo .value span::text').extract(),
'url': container.css('div#digitalLinks li a').extract(),
'partner': container.css('div.logoContainer img:first-child').xpath('#src').extract(),
}
And an example of my output:
{
"text": ["Soulsby no. 46(1)"],
"uniformtitle": ["Systema naturae"],
"title": ["Caroli Linn\u00e6i ... Systema natur\u00e6\nin quo natur\u00e6 regna tria, secundum classes, ordines, genera, species, systematice proponuntur."],
"opac": ["002178079"],
"url": [],
"partner": []
},
I am hoping I am doing something silly and easy to fix! Both of the paths I am using for "url" and "partner" were working from here:
scrapy shell 'http://www.linnaeuslink.org/records/record/1'
So, I just don't know what I am missing.
Oh, and exporting to json by using this command for now:
scrapy crawl linlinks -o quotes.json
Thanks for your help!
The problem seems to be that those selectors are not "findable" inside any div.item you probably have validated them without the response.css('div.item') to replicate what you used in the shell just replace the container.css by response.css for the url and partner keys.
Related
I am totally new to python and scrapy stuff and scrapy documentations are not exactlly noob friendly😢😢. I made a spider for my school project which scrapes the data I want successfully but the problem is with the formatting in json export.
This is just a mock of how my code looks like;
def parse_links(self, response):
products = response.css('qwerty')
for product in products:
yield {
'Title' : response.xpath('/html/head/title/text()').get()
'URL' : response.url,
'Product' : response.css('product').getall(),
'Manufacturer' : response.xpath('Manufacturer').getall(),
'Description' : response.xpath('Description').getall(),
'Rating' : response.css('rating').getall(),
}.
The export in json looks something like this;
[{"Title": "x", "URL": "https://y.com", "Product": ["a", "e"], "Manufacturer": ["b", "f"], "Description": ["c", "g"], "Rating": ["d", "h"]}].
To be precise this is how it looks now.
But I want the data to be exported in this format;
[{"Products": [{"Title":"x","URL":"https://y.com", "Links":[{"Product":"a","Manufacturer":"b","Description":"c","Rating":"d"},{"Product":"e","Manufacturer":"f","Description":"g","Rating":"h"}]}]}]
This is how I want the data.
I tried somethings from web but nothing worked and I couldn't find any explanatory documents in Scrapy site. The ones provided are not easy to understand for someone new like me as I said earlier. So any help would be great for me. I made the scraper pretty easily but have been stuck on this for a day.
FYI I am not using any custom pipeline and items.
Thanks in advance and have a great day.
Try this one json parsing
def parse_links(self, response):
products = response.css('qwerty')
for product in products:
AllResopnse = []
Links = []
Links.append({"Product":response.css('product').getall(),"Manufacturer":response.xpath('Manufacturer').getall(),"Description":response.xpath('Description').getall(),"Rating":response.css('rating').getall()})
TitleDict = {"Title":response.xpath('/html/head/title/text()').get(),"URL":"https://y.com","Links":Links}
ResponseData = {"Products":[TitleDict]}
yield ResponseData
Here is my code guys, to explain first of all I scraped listing links, then I yielded response to go through every link of a listing and then parse some info e.g name,address,price,number. While running it in terminal I get some errors such as (price = response.css('div.article_right_price::text').get().strip()
AttributeError: 'NoneType' object has no attribute 'strip'), but I still can export it into csv without problem, but one thing this particular language is Georgian and when I export it to CSV i only see symbols which are not georgian :)) i would be grateful if someone could help me.
import scrapy
class SsHomesSpider(scrapy.Spider):
name = 'ss_home'
start_urls = ['https://ss.ge/ka/udzravi-qoneba/l/bina/qiravdeba?CurrentUserId=&Query=&MunicipalityId=95&CityIdList=95&subdistr=&stId=&PrcSource=2&StatusField.FieldId=34&StatusField.Type=SingleSelect&StatusField.StandardField=Status&StatusField.SelectedValues=2&QuantityFrom=&QuantityTo=&PriceType=false&CurrencyId=2&PriceFrom=300&PriceTo=500&Context.Request.Query%5BQuery%5D=&IndividualEntityOnly=true&Fields%5B3%5D.FieldId=151&Fields%5B3%5D.Type=SingleSelect&Fields%5B3%5D.StandardField=None&Fields%5B4%5D.FieldId=150&Fields%5B4%5D.Type=SingleSelect&Fields%5B4%5D.StandardField=None&Fields%5B5%5D.FieldId=152&Fields%5B5%5D.Type=SingleSelect&Fields%5B5%5D.StandardField=None&Fields%5B6%5D.FieldId=29&Fields%5B6%5D.Type=SingleSelect&Fields%5B6%5D.StandardField=None&Fields%5B7%5D.FieldId=153&Fields%5B7%5D.Type=MultiSelect&Fields%5B7%5D.StandardField=None&Fields%5B8%5D.FieldId=30&Fields%5B8%5D.Type=SingleSelect&Fields%5B8%5D.StandardField=None&Fields%5B0%5D.FieldId=48&Fields%5B0%5D.Type=Number&Fields%5B0%5D.StandardField=None&Fields%5B0%5D.ValueFrom=&Fields%5B0%5D.ValueTo=&Fields%5B1%5D.FieldId=146&Fields%5B1%5D.Type=Number&Fields%5B1%5D.StandardField=None&Fields%5B1%5D.ValueFrom=&Fields%5B1%5D.ValueTo=&Fields%5B2%5D.FieldId=28&Fields%5B2%5D.Type=Number&Fields%5B2%5D.StandardField=Floor&Fields%5B2%5D.ValueFrom=&Fields%5B2%5D.ValueTo=&Fields%5B9%5D.FieldId=15&Fields%5B9%5D.Type=Group&Fields%5B9%5D.StandardField=None&Fields%5B9%5D.Values%5B0%5D.Value=35&Fields%5B9%5D.Values%5B1%5D.Value=36&Fields%5B9%5D.Values%5B2%5D.Value=37&Fields%5B9%5D.Values%5B3%5D.Value=38&Fields%5B9%5D.Values%5B4%5D.Value=39&Fields%5B9%5D.Values%5B5%5D.Value=40&Fields%5B9%5D.Values%5B6%5D.Value=41&Fields%5B9%5D.Values%5B7%5D.Value=42&Fields%5B9%5D.Values%5B8%5D.Value=24&Fields%5B9%5D.Values%5B9%5D.Value=27&Fields%5B9%5D.Values%5B10%5D.Value=22&Fields%5B9%5D.Values%5B11%5D.Value=20&Fields%5B9%5D.Values%5B12%5D.Value=8&Fields%5B9%5D.Values%5B13%5D.Value=6&Fields%5B9%5D.Values%5B14%5D.Value=4&Fields%5B9%5D.Values%5B15%5D.Value=5&Fields%5B9%5D.Values%5B16%5D.Value=9&Fields%5B9%5D.Values%5B17%5D.Value=3&Fields%5B9%5D.Values%5B18%5D.Value=120&AgencyId=&VipStatus=&Fields%5B9%5D.Values%5B0%5D.Selected=false&Fields%5B9%5D.Values%5B1%5D.Selected=false&Fields%5B9%5D.Values%5B2%5D.Selected=false&Fields%5B9%5D.Values%5B3%5D.Selected=false&Fields%5B9%5D.Values%5B4%5D.Selected=false&Fields%5B9%5D.Values%5B5%5D.Selected=false&Fields%5B9%5D.Values%5B6%5D.Selected=false&Fields%5B9%5D.Values%5B7%5D.Selected=false&Fields%5B9%5D.Values%5B8%5D.Selected=false&Fields%5B9%5D.Values%5B9%5D.Selected=false&Fields%5B9%5D.Values%5B10%5D.Selected=false&Fields%5B9%5D.Values%5B11%5D.Selected=false&Fields%5B9%5D.Values%5B12%5D.Selected=false&Fields%5B9%5D.Values%5B13%5D.Selected=false&Fields%5B9%5D.Values%5B14%5D.Selected=false&Fields%5B9%5D.Values%5B15%5D.Selected=false&Fields%5B9%5D.Values%5B16%5D.Selected=false&Fields%5B9%5D.Values%5B17%5D.Selected=false&Fields%5B9%5D.Values%5B18%5D.Selected=false']
def parse(self, response):
all_listing = response.css('div.latest_desc a::attr(href)')
for list in all_listing:
yield response.follow(list.get(), callback=self.parse_listings)
def parse_listings(self, response):
name = response.css('div.article_in_title h1::text').get()
price = response.css('div.article_right_price::text').get().strip()
square_m = response.css('div.WholeFartBlock text::text').get().strip()
street = response.css('div.StreeTaddressList a::text').get().strip()
number = response.css('div.UserMObileNumbersBlock a::attr(href)').get().strip("tel':")
yield {
'name': name,
'price': price,
'square_m': square_m,
'street': street,
'number': number
}
The error you are getting is not scrapy related. You are calling method strip() on a None object. Your selectors are returning None instead of the string value you are expecting. Check your selectors again and also consider using scrapy itemloaders to clean your items.
I'm new to scrapy package, and here's my problem:
import scrapy
class simpleSpider(scrapy.Spider):
name = "simple_spider"
start_urls = ['http://quotes.toscrape.com/login']
def parse(self, response):
token = response.css("input[name=csrf_token] ::attr(value)").extract_first()
formdata = {
'csrf_token' : token,
'username' : 'rseiji',
'password' : 'seiji1234'
}
yield scrapy.FormRequest(response.url, formdata=formdata, callback=self.parse_logged)
def parse_logged(self, response):
yield {
'text' : response.css('span.text::Text').extract(),
'author' : response.css('small.author::Text').extract(),
'tags' : response.css('div.tags a.tag::Text').extract()
}
This is my spider. And it does work. But when I try to:
scrapy crawl simple_spider -o mySpider.csv
the .csv file doesn't seen to be correctly formated. It extracts only the "text" column.
What's wrong?
Thank you!
Editted: This is my .csv file:
text,author,tags
"“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,“It is our choices, Harry, that show what we truly are, far more than our abilities.”,“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”,“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”,“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”,“Try not to become a man of success. Rather become a man of value.”,“It is better to be hated for what you are than to be loved for what you are not.”,“I have not failed. I've just found 10,000 ways that won't work.”,“A woman is like a tea bag; you never know how strong it is until it's in hot water.”,“A day without sunshine is like, you know, night.”","Albert Einstein,J.K. Rowling,Albert Einstein,Jane Austen,Marilyn Monroe,Albert Einstein,André Gide,Thomas A. Edison,Eleanor Roosevelt,Steve Martin","change,deep-thoughts,thinking,world,abilities,choices,inspirational,life,live,miracle,miracles,aliteracy,books,classic,humor,be-yourself,inspirational,adulthood,success,value,life,love,edison,failure,inspirational,paraphrased,misattributed-eleanor-roosevelt,humor,obvious,simile"
...
Figure out now that it is not that there are empty columns. The .csv file format is not well defined. Everything came up in just one row!
Solved!
import scrapy
class simpleSpider(scrapy.Spider):
name = "simple_spider"
start_urls = ['http://quotes.toscrape.com/login']
def parse(self, response):
formdata = {
'username' : 'rseiji',
'password' : 'seiji1234'
}
yield scrapy.FormRequest.from_response(response, formdata=formdata, callback=self.parse_logged,)
def parse_logged(self, response):
# Get list of Selector objects and loop through them
for quote in response.css('div.quote'):
# yield each item individually
yield {
'text' : quote.css('span.text::Text').extract_first(),
'author' : quote.css('small.author::Text').extract_first(),
'author_goodreads_url' : quote.css('span a[href*="goodreads.com"]::attr(href)').extract_first(),
'tags' : quote.css('div.tags a.tag::Text').extract()
}
The problem was I was using extract(). What I wanted to do was get a list of Selector objects.
Using extract() will always produce a list output. When you use extract you get a list of strings of the html you requested with the selector, or when using extract_first() a single string. By not using extract() nor extract_first() you create a list of selectors which you can then iterate through and chain a new selector on it allowing you to pickout each individual item.
System: Windows 10, Python 2.7.15, Scrapy 1.5.1
Goal: Retrieve text from within html markup for each of the link items on the target website, including those revealed (6 at a time) via the '+ SEE MORE ARCHIVES' button.
Target Website: https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info
Initial Progress: Python and Scrapy successfully installed. The following code...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
urls = [
'https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...successfully produces the following results (when -o to .csv)...
href,eventtype,eventmonth,eventdate,eventyear
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-08-02,Competitive Standard Constructed League, August ,2, 2018
/en/articles/archive/mtgo-standings/pauper-constructed-league-2018-08-01,Pauper Constructed League, August ,1, 2018
/en/articles/archive/mtgo-standings/competitive-modern-constructed-league-2018-07-31,Competitive Modern Constructed League, July ,31, 2018
/en/articles/archive/mtgo-standings/pauper-challenge-2018-07-30,Pauper Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/legacy-challenge-2018-07-30,Legacy Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-07-30,Competitive Standard Constructed League, July ,30, 2018
However, the spider will not touch any of the the info buried by the Ajax button. I've done a fair amount of Googling and digesting of documentation, example articles, and 'help me' posts. I am under the impression that to get the spider to actually see the ajax-buried info, that I need to simulate some sort of request. Variously, the correct type of request might be something to do with XHR, a scrapy FormRequest, or other. I am simply too new to web archetecture in general to be able to surmise the answer.
I hacked together a version of the initial code that calls a FormRequest, which seems to be able to still reach the initial page just fine, yet incrementing the only parameter that appears to change (when inspecting the xhr calls sent out when physically clicking the button on the page) does not appear to have an effect. That code is here...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
for i in range(1,10):
yield scrapy.FormRequest(url='https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info', formdata={'l':'en','f':'9041','search-result-theme':'','limit':'6','fromDate':'','toDate':'','event_format':'0','sort':'DESC','word':'','offset':str(i*6)}, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...and the results are the same as before, except the 6 output lines are repeated, as a block, 9 extra times.
Can anyone help point me to what I am missing? Thank you in advance.
Postscript: I always seem to get heckled out of my chair whenever I seek help for coding problems. If I am doing something wrong, please have mercy on me, I will do whatever I can to correct it.
Scrapy don't render dynamic content very well, you need something else to deal with Javascript. Try these:
scrapy + selenium
scrapy + splash
This blog post about scrapy + splash has a good introduction on the topic.
I am working on Scrapy to scrap the website. And I want to extract only those items which have not been scraped in its previous run.
I am trying it on "https://www.ndtv.com/top-stories" website to extract only 1st headline if it is updated.
Below is my code:
import scrapy
from selenium import webdriver
from w3lib.url import url_query_parameter
class QuotesSpider(scrapy.Spider):
name = "test"
start_urls = [
'https://www.ndtv.com/top-stories',
]
def parse(self, response):
print ('testing')
print(response.url)
yield {
'heading': response.css('div.nstory_header a::text').extract_first(),
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
}
SPIDER_MIDDLEWARES = {
#'inc_crawling.middlewares.IncCrawlingSpiderMiddleware': 543,
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
'scrapy_deltafetch.DeltaFetch': 100,
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
'scrapylib.deltafetch.DeltaFetch': 100,
'inc_crawling.middlewares.deltafetch.DeltaFetch': 100,
}
COOKIES_ENABLED = True
COOKIES_DEBUG = True
DELTAFETCH_ENABLED = True
DELTAFETCH_DIR = '/home/administrator/apps/inc_crawling'
DOTSCRAPY_ENABLED = True
I have updated above code in setting.py file:
I am running the above code using "scrapy crawl test -o test.json" command and after each run .db file and test.json file gets updated.
So, my expectation is whenever the 1st headline is updated only then .db gets updated.
kindly help me if there is any better approach to extract updated headline.
a good way to implement this would be to override the DUPEFILTER_CLASS to check your database before doing the actual requests.
Scrapy uses a dupefilter class to avoid getting the same request twice, but it only works for running spiders.