I have an example of a scrapy project. it is pretty much default. its folder structure:
craiglist_sample/
├── craiglist_sample
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── test.py
└── scrapy.cfg
When you write scrapy crawl craigs -o items.csv -t csv to windows command prompt it writes craiglist items and links to console.
I want to create an example.py in main folder and print these items to python console inside it.
I tried
from scrapy import cmdline
cmdline.execute("scrapy crawl craigs".split())
but it writes the same as windows shell output. How can I make it print only items and list?
test.py
:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craiglist_sample.items import CraiglistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
## allowed_domains = ["sfbay.craigslist.org"]
## start_urls = ["http://sfbay.craigslist.org/npo/"]
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.tr.craigslist.org/search/npo?"]
##search\/npo\?s=
rules = (Rule (SgmlLinkExtractor(allow=('s=\d00',),restrict_xpaths=('//a[#class="button next"]',))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//span[#class="pl"]')
## titles = hxs.select("//p[#class='row']")
items = []
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/#href").extract()
items.append(item)
return(items)
An approach could be turning off the default shell output of scrapy and insert a print command inside your parse_items function.
1 - Turn off the debug level in file settings.py
LOG_ENABLED = False
Documentation about logging levels in Scrapy here: http://doc.scrapy.org/en/latest/topics/logging.html
2 - Add a print command for the items you are interested
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/#href").extract()
items.append(item)
print item ["title"], item ["link"]
The shell output will be:
[u'EXECUTIVE ASSISTANT'] [u'/eby/npo/4848086929.html']
[u'Direct Support Professional'] [u'/eby/npo/4848043371.html']
[u'Vocational Counselor'] [u'/eby/npo/4848042572.html']
[u'Day Program Supervisor'] [u'/eby/npo/4848041846.html']
[u'Educational Specialist'] [u'/eby/npo/4848040348.html']
[u'ORGANIZE WITH GREENPEACE - Grassroots Nonprofit Job!']
[u'/eby/npo/4847984654.html']
EDIT Code for executing from a script
import os
os.system('scrapy crawl craigs > log.txt')
There are several other ways for executing line program within python.
Check Executing command line programs from within python and Calling an external command in Python
Related
I cannot load scrapy items to scrapy spiders. Here is my project structure:
-rosen
.log
.scrapers
..scrapers
...spiders
....__init__.py
....exampleSpider.py
...__init__.py
...items.py
...middlewares.py
...pipelines.py
...settings.py
.src
..__init__.py
..otherStuff.py
.tmp
This structure has been created using scrapy startproject scrapers inside of rosen project (directory).
Now, the items.py has the following code:
import scrapy
from Decimal import Decimal
class someItem(scrapy.Item):
title: str = scrapy.Field(serializer=str)
bid: Decimal = scrapy.Field(serializer=Decimal)
And the exampleSpider.py has the following code:
import scrapy
from __future__ import absolute_import
from scrapy.loader import ItemLoader
from scrapers.scrapers.items import someItem
class someSpider(scrapy.Spider):
name = "some"
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._some_fields = someItem()
def parse(self, response) -> None:
some_loader = ItemLoader(item=self._some_fields, response=response)
print(self._some_fields.keys())
The error I get is the following: runspider: error: Unable to load 'someSpider.py': No module named 'scrapers.scrapers'
I found Scrapy: ImportError: No module named items and tried all three solutions by renaming and adding from __future__ import absolute_import. Nothing helps. Please advice.
The command that I execute is scrapy runspider exampleSpider.py. I tried it from the spiders and rosen directories.
i do not see any virtualenv inside your directory. So i recommend you to do so eg. under 'rosen'.
you can try this:
try:
from scrapers.items import someItem
except FileNotFoundError:
from scrapers. scrapers.items import someItem
then cal it with:
scrapy crawl NameOfSpider
or:
scrapy runspider path/to/spider.py
I want to try some method in my spider.
For example in my project, I have this schema:
toto/
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
├── spiders
│ ├── __init__.py
│ └── mySpider.py
└── Unitest
└── unitest.py
my unitest.py look like that:
# -*- coding: utf-8 -*-
import re
import weakref
import six
import unittest
from scrapy.selector import Selector
from scrapy.crawler import Crawler
from scrapy.utils.project import get_project_settings
from unittest.case import TestCase
from toto.spiders import runSpider
class SelectorTestCase(unittest.TestCase):
sscls = Selector
def test_demo(self):
print "test"
if __name__ == '__main__':
unittest.main()
and my mySpider.py, look like that:
import scrapy
class runSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://blog.scrapinghub.com']
def parse(self, response):
for url in response.css('ul li a::attr("href")').re(r'.*/\d\d\d\d/\d\d/$'):
yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
In my unitest.py file, How I can call my spider ?
I tried to add from toto.spiders import runSpider in my unitest.py file, but but it does not...
I've got this error:
Traceback (most recent call last): File "unitest.py", line 10, in
from toto.spiders import runSpider ImportError: No module named toto.spiders
How I can fix It?
Try:
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(os.path.realpath(__file__)), '../..')) #2 folder back from current file
from spiders.mySpider import runSpider
my file list:
.
|-- lf
| |-- __init__.py
| |-- __init__.pyc
| |-- items.py
| |-- items.pyc
| |-- pipelines.py
| |-- settings.py
| |-- settings.pyc
| `-- spiders
| |-- bbc.py
| |-- bbc.pyc
| |-- __init__.py
| |-- __init__.pyc
| |-- lwifi.py
| `-- lwifi.pyc
|-- scrapy.cfg
`-- script.py
items.py
from scrapy.item import Item, Field
class LfItem(Item):
topic = Field();
script.py:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from lf.spiders.lwifi import LwifiSpider
from scrapy.utils.project import get_project_settings
spider = LwifiSpider(domain='Lifehacker.co.in')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
lwifi.py:
from scrapy.spider import Spider
from scrapy.selector import Selector
class LwifiSpider(Spider):
name = "lwifi"
def __init__(self, **kw):
super(LwifiSpider, self).__init__(**kw)
url = kw.get('url') or kw.get('domain') or 'lifehacker.co.in/others/Dont-Use- Personal-Information-in-Your-Wi-Fi-Network-Name/articleshow/45407704.cms'
if not url.startswith('http://') and not url.startswith('https://'):
url = 'http://%s/' % url
self.url = url
self.allowed_domains = ["lifehacker.co.in/others/Dont-Use-Personal-Information-in-Your-Wi-Fi-Network-Name/articleshow/45407704.cms"]
def start_requests(self):
return [Request(self.url, callback=self.parse)]
def parse(self, response):
topic = response.xpath("//h1/text()").extract();
print topic
i am new to python and scrapy. As a start i wrote a simple scrapy spider to run from python script (not using scrapinghub). My aim is to scrap the h1 from the page http://lifehacker.co.in/others/Dont-Use-Personal-Information-in-Your-Wi-Fi-Network-Name/articleshow/45407704.cms .The error is
Traceback (most recent call last):
File "script.py", line 4, in <module>
from lf.spiders.lwifi import LwifiSpider
File "/home/ajay/pythonpr/error/lf/lf/spiders/lwifi.py", line 7, in <module>
class LwifiSpider(Spider):
File "/home/ajay/pythonpr/error/lf/lf/spiders/lwifi.py", line 11, in LwifiSpider
url = kw.get('url') or kw.get('domain') or 'lifehacker.co.in/others/Dont-Use-Personal- Information-in-Your-Wi-Fi-Network-Name/articleshow/45407704.cms'
NameError: name 'kw' is not defined
please help.
If you look carefully at the traceback, you will see that the error occurs in the body of the LwifiSpider class:
File "/home/.../lwifi.py", line 11, in LwifiSpider
If the error occurred in the __init__ of that class you'd see a line like this instead:
File "/home/.../lwifi.py", line 11, in __init__
So it would appear that there is some kind of indentation error that is causing the problematic line to be outside of the __init__ method, where the kw argument cannot be seen.
Try re-indenting the whole of the __init__ function, and make sure you are not mixing tabs and spaces anywhere (any decent text editor should allow you to make all the whitespace visible).
Follow these simple steps on how to run Python/Ajax crawls. How to set interactive crawls using AJax
I am using Scrapy, it is great! so fast to build a crawler. with the number of web sites are increasing, need to create new spiders, but these web sits are the same type,
all these spiders use same items, pipelines, parsing process
the contents of the project directory:
test/
├── scrapy.cfg
└── test
├── __init__.py
├── items.py
├── mybasespider.py
├── pipelines.py
├── settings.py
├── spider1_settings.py
├── spider2_settings.py
└── spiders
├── __init__.py
├── spider1.py
└── spider2.py
To reduce source code redundancy, mybasespider.py has a base spider MyBaseSpider, 95% source code are in it, all other spiders inherited from it, if a spider has some special things, override some
class methods, normally only need to add several lines source code to create a new spider
Place all common settings in settings.py, one spider's special settings are in [spider name]_settings.py, such as:
the special settings of spider1 in spider1_settings.py:
from settings import *
LOG_FILE = 'spider1.log'
LOG_LEVEL = 'INFO'
JOBDIR = 'spider1-job'
START_URLS = [
'http://test1.com/',
]
the special settings of spider2 in spider2_settings.py:
from settings import *
LOG_FILE = 'spider2.log'
LOG_LEVEL = 'DEBUG'
JOBDIR = 'spider2-job'
START_URLS = [
'http://test2.com/',
]
Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider;
All urls in START_URLS are filled into MyBaseSpider.start_urls, different spider has different contents, but the name START_URLS used in the base spider MyBaseSpider isn't changed.
the contents of the scrapy.cfg:
[settings]
default = test.settings
spider1 = spider1.settings
spider2 = spider2.settings
[deploy]
url = http://localhost:6800/
project = test
To run a spider, such as spider1:
export SCRAPY_PROJECT=spider1
scrapy crawl spider1
But this way can't be used to run spiders in scrapyd. scrapyd-deploy command always uses 'default' project name in scrapy.cfg 'settings' section to build an egg file and deploys it to scrapyd
Have several questions:
Is this the way to use multiple spiders in one project if I don't create a project per spider? Are there any better ways?
How to separate a spider's special settings as above which can run in scrapyd and reduce source code redundancy
If all spiders use a same JOBDIR, is it safe to run all spiders concurrently? Is the persistent spider state corrupted?
Any insights would be greatly appreciated.
I don't know if it will answer to your first question but I use scrapy with multiple spiders and in the past i use the command
scrapy crawl spider1
but if I had more then one spider this command activate it or another modules so I start to use this command:
scrapy runspider <your full spider1 path with the spiderclass.py>
example: "scrapy runspider home/Documents/scrapyproject/scrapyproject/spiders/spider1.py"
I hope it will help :)
As all spiders should have their own class, you could set the settings per spider with the custom_settings class argument, so something like:
Class MySpider1(Spider):
name = "spider1"
custom_settings = {'USER_AGENT': 'user_agent_for_spider1/version1'}
Class MySpider1(Spider):
name = "spider1"
custom_settings = {'USER_AGENT': 'user_agent_for_spider2/version2'}
this custom_settings will overwrite the ones on the settings.py file so you could still set some global ones.
Good job! I didn't find better way to manager multiple spiders in the documentation.
I don't know about scrapyd. But when run from command line, you should set environment variable SCRAPY_PROJECT to the target project.
see scrapy/utils/project.py
ENVVAR = 'SCRAPY_SETTINGS_MODULE'
...
def get_project_settings():
if ENVVAR not in os.environ:
project = os.environ.get('SCRAPY_PROJECT', 'default')
init_env(project)
I have a very basic spider, following the instructions in the getting started guide, but for some reason, trying to import my items into my spider returns an error. Spider and items code is shown below:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from myProject.items import item
class MyProject(BaseSpider):
name = "spider"
allowed_domains = ["website.com"]
start_urls = [
"website.com/start"
]
def parse(self, response):
print response.body
from scrapy.item import Item, Field
class ProjectItem(Item):
title = Field()
When I run this code scrapy either can't find my spider, or can't import my items file. What's going on here? This should be a really example to run right?
I also had this several times while working with scrapy. You could add at the beginning of your Python modules this line:
from __future__ import absolute_import
More info here:
http://www.python.org/dev/peps/pep-0328/#rationale-for-absolute-imports
http://pythonquirks.blogspot.ru/2010/07/absolutely-relative-import.html
you are importing a field ,you must import a class from items.py
like from myproject.items import class_name.
So, this was a problem that I came across the other day that I was able to fix through some trial and error, but I wasn't able to find any documentation of it so I thought I'd put this up in case anyone happens to run into the same problem I did.
This isn't so much an issue with scrapy as it is an issue with naming files and how python deals with importing modules. Basically the problem is that if you name your spider file the same thing as the project then your imports are going to break. Python will try to import from the directory closest to your current position which means it's going to try to import from the spider's directory which isn't going to work.
Basically just change the name of your spider file to something else and it'll all be up and running just fine.
if the structure like this:
package/
__init__.py
subpackage1/
__init__.py
moduleX.py
moduleY.py
subpackage2/
__init__.py
moduleZ.py
moduleA.py
and if you are in moduleX.py, the way to import other modules can be:
from .moduleY.py import *
from ..moduleA.py import *
from ..subpackage2.moduleZ.py import *
refer:PEP Imports: Multi-Line and Absolute/Relative