Export csv file from scrapy (not via command line) - python

I successfully tried to export my items into a csv file from the command line like:
scrapy crawl spiderName -o filename.csv
My question is:
What is the easiest solution to do the same in the code? I need this as i extract the filename from another file.
End scenario should be, that i call
scrapy crawl spiderName
and it writes the items into filename.csv

Why not use an item pipeline?
WriteToCsv.py
import csv
from YOUR_PROJECT_NAME_HERE import settings
def write_to_csv(item):
writer = csv.writer(open(settings.csv_file_path, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
class WriteToCsv(object):
def process_item(self, item, spider):
write_to_csv(item)
return item
settings.py
ITEM_PIPELINES = { 'project.pipelines_path.WriteToCsv.WriteToCsv' : A_NUMBER_HIGHER_THAN_ALL_OTHER_PIPELINES}
csv_file_path = PATH_TO_CSV
If you wanted items to be written to separate csv for separate spiders you could give your spider a CSV_PATH field. Then in your pipeline use your spiders field instead of path from setttigs.
This works I tested it in my project.
HTH
http://doc.scrapy.org/en/latest/topics/item-pipeline.html

There an updated way to save your file in scrapy which is using "FEEDS"
class mySpider(scrapy.Spider):
name = "myProject"
custom_settings = {
"FEEDS":{"fileName.csv":{"format":"csv"}},
}

That's what Feed Exports are for:
http://doc.scrapy.org/en/latest/topics/feed-exports.html
One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating a “export file” with the scraped data (commonly called “export feed”) to be consumed by other systems.
Scrapy provides this functionality out of the box with the Feed Exports, which allows you to generate a feed with the scraped items, using multiple serialization formats and storage backends.

Up-to-date answer is:
Use build-in exporter. You can set filename as key. Config may look like:
filename = 'export'
class mySpider(scrapy.Spider):
custom_settings = {
'FEEDS': {
f'{filename}.csv': {
'format': 'csv',
'overwrite': True
}
}
}
Documentation: https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS

Related

Scraping multiple urls and storing the corresponding data in separate files in scrapy

I'm a university student working on a project and I'm pretty much new to using Scrapy. I've done as much research as I can stackoverflow/ yt but I cannot seem to integrate the ideas I have seen to what I'm trying to do. Basically, I have a list of urls that I need to scrape data from. I want to save/export the data scraped from each url to a corresponding json/csv file. The ultimate goal would then be to transfer those files to a database. I have managed to code the spider to get the data, however, I have to manually change the url in the spider class and export a file at a time. I cannot seem to figure out a way to automate it. This is my first time posting to stackoverflow, if you can help me this would be very much appreciated.
I have looked at pipelines/using open function with a write but I don't think I understood how to use them to export multiple files based on different urls.
I may not have formulated my question in the right way. I need to go through the urls list, scrape reviews and corresponding rating from the website and store those, ideally in a json/ database where i will access them later, clean the data and then feed them to a sentiment analysis model.
so for example,
I have a list of urls that I need to go through (I have them stored in a csv file), check if they are "good" as some of the links don't work/ have no reviews
scrape reviews and corresponding rating from the website and store those. json/ csv file for now as my plan is to add them to a database later(sql).
[i'm working on figuring out if i can do it now itself though]
as I will need to run a sentiment analysis model on those to predict the sentiment associated with a review and test the prediction against a given rating
note: because there is quite a lot of data to scrape, I was planning to clean it afterwards, thoughts?
I have now included a copy of the code. I started my manually changing the url myself and using the command line to export the file like:
scrapy crawl spidey -O name_of_file.json
however this is not a very efficient way to go about it.
here is a snap of my code, https://imgur.com/GlgmB0q.
I have added the item loader and multiple urls, before it was simply yield the items and a single url that i manually changed
Feel free to msg me if you can help, I would really much appreciate it. Twitter, discord etc...
You can use an item pipeline to split the output into multiple files. See below sample code.
import scrapy
from scrapy.exporters import CsvItemExporter
from urllib.parse import urlparse
class PerFilenameExportPipeline:
def open_spider(self, spider):
self.filename_to_exporter = {}
def close_spider(self, spider):
for exporter in self.filename_to_exporter.values():
exporter.finish_exporting()
def _exporter_for_item(self, item):
filename = item.get('url').strip()
if filename not in self.filename_to_exporter:
f = open(f'{filename}.csv', 'wb')
exporter = CsvItemExporter(f, export_empty_fields=True)
exporter.start_exporting()
self.filename_to_exporter[filename] = exporter
return self.filename_to_exporter[filename]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item)
exporter.export_item(item)
return item
class SampleSpider(scrapy.Spider):
name = 'sample'
start_urls = ['https://example.com', "https://www.scrapethissite.com/pages/simple", "https://stackoverflow.com"]
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
"ITEM_PIPELINES": {PerFilenameExportPipeline: 100}
}
def parse(self, response):
parsed = urlparse(response.url)
yield {
"url": parsed.netloc,
"title": response.css("title::text").get()
}

Can't get rid of a problematic page even when using rotation of proxies within scrapy

I've created a script using scrapy implementing rotation of proxies within it to parse the address from few hundreds of similar links like this. I've supplied those links from a csv file within the script.
The script is doing fine until it encounters any response url like this https://www.bcassessment.ca//Property/UsageValidation. Given that once the script starts getting that link, it can't bypass that. FYI, I'm using meta properties containing lead_link to make use of original link instead of redirected link as a retry, so I should be able to bypass that barrier.
It doesn't happen when I use proxies within requests library. To be clearer - while using requests library, the script does encounter this page /Property/UsageValidation but bypass that successfully after few retries.
The spider is like:
class mySpider(scrapy.Spider):
name = "myspider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'stackoverflow_spider.middlewares.ProxiesMiddleware': 100,
}
}
def start_requests(self):
with open("output_main.csv","r") as f:
reader = csv.DictReader(f)
for item in list(reader):
lead_link = item['link']
yield scrapy.Request(lead_link,self.parse,meta={"lead_link":lead_link,"download_timeout":20}, dont_filter=True)
def parse(self,response):
address = response.css("h1#mainaddresstitle::text").get()
print(response.meta['proxy'],address)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
'LOG_LEVEL':'ERROR',
})
c.crawl(mySpider)
c.start()
How can I let the script not to encounter that page?
PS I've attached few of the links within a text file in case anyone wants to give a try.
To make session safe proxy implementation for scrapy app You
need to add additional cookiejar meta key to place where you assign proxy to request.meta like this:
....
yield scrapy.Request(url=link, meta = {"proxy":address, "cookiejar":address})
In this case scrapy cookiesMiddleware will create additional cookieSession for each proxy.
related specifics of scrapy proxy implementation mentioned in this answer

Scrapy - How to write to a custom FEED_URI

I'm new to Scrapy and I would like write backups of the HTML that s3. I found that by using the following, I could write a particular scrape's html:
settings.py
ITEM_PIPELINE = {
'scrapy.pipelines.files.S3FilesStore': 1
}
AWS_ACCESS_KEY_ID = os.environ['S3_MAIN_KEY']
AWS_SECRET_ACCESS_KEY= os.environ['S3_MAIN_SECRET']
FEED_FORMAT = "json"
FEED_URI=f's3://bucket_name/%(name)s/%(today)s.html'
And then in my scraper file:
def parse(self, response):
yield {'body': str(response.body, 'utf-8')}
However, I would like to write to a key that includes the url as a subfolder, for example:
FEED_URI=f's3://bucket_name/%(name)s/%(url)s/%(today)s.html'
How can I dynamically grab the url for the FEED_URI. I'm assuming that in
def start_requests(self):
urls = [
'http://www.example.com',
'http://www.example_1.com',
'http://www.example_2.com',
]
I have multiple urls. Also, is there anyway to write the raw HTML file, not nested in JSON? Thanks.
Feed exports are not meant to export to a different file per item.
For that, write an item pipeline instead.

how can I access a variable parameter at spider class from pipelines.py

I have 3 number of spider files and classes. And I want to save item informations at csv file which has different filename defendant the variable parameter of searching condition. For that, I need to access the spider class parameter.
So, my questions are three.
How can I access the spider class's parameter?
What is the best way to make each csv files? The trigger condition is that will call request at parse function for new searching result.
logger = logging.getLogger(__name__) it's not working in pipelines.py
How can I print that information?
Bellow is my log code style
logger.log(logging.INFO,'\n======= %s ========\n', filename)
I had been searching the ways in google so many times. But I couldn't find the solution.
I did try to use from_crawler function, but I couldn't find the adapt case
Scrapy 1.6.0
python 3.7.3
os window 7 / 32bit
Code:
class CensusGetitemSpider(scrapy.Spider):
name = 'census_getitem'
startmonth=1
filename = None
def parse(self, response):
for x in data:
self.filename = str(startmonth+1)
.
.
.
yield item
yield scrapy.Request(link, callback=self.parse)
you can access spider class and instance attributes from pipeline.py using the spider parameter passed in most of pipeline methods.
For example, :
open_spider(self, spider):
self.filename = spider.name
You can see more about item pipelines here https://docs.scrapy.org/en/latest/topics/item-pipeline.html
You can save it directly from the command line, just define a filename:
scrapy crawl yourspider -o output.csv
But if you really need it to be set from the spider, you can use a custom setting per spider, for example:
class YourSpider(scrapy.Spider):
name = 'yourspider'
start_urls = 'www.yoursite.com'
custom_settings = {
'FEED_URI':'output.csv',
'FEED_FORMAT': 'csv',
}
Use spider.logger.info('Your message')

scrapy set the output file in code

I am using scrapy with python
I can set the output json file in the cmd. but now I need to do that in code.
I tried this:
in the setting
FEED_EXPORTERS = {
'jsonlines': 'scrapy.contrib.exporter.JsonLinesItemExporter',
}
FEED_FORMAT = 'jsonlines'
in the spider
def __init(self):
settings.overrides['FEED_URI'] = 'output.json'
Note
I am developing a simple spider, so I just need Item Exporter, I don't need to create any item pipeline.
Thanks for helping
The answer is found in an example on the Scrapy documentation. You can output to any format by writing the correct item pipeline, as follows:
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
Note that you must also include this pipeline in the default Scrapy project settings file.

Categories

Resources