I have a very basic spider, following the instructions in the getting started guide, but for some reason, trying to import my items into my spider returns an error. Spider and items code is shown below:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from myProject.items import item
class MyProject(BaseSpider):
name = "spider"
allowed_domains = ["website.com"]
start_urls = [
"website.com/start"
]
def parse(self, response):
print response.body
from scrapy.item import Item, Field
class ProjectItem(Item):
title = Field()
When I run this code scrapy either can't find my spider, or can't import my items file. What's going on here? This should be a really example to run right?
I also had this several times while working with scrapy. You could add at the beginning of your Python modules this line:
from __future__ import absolute_import
More info here:
http://www.python.org/dev/peps/pep-0328/#rationale-for-absolute-imports
http://pythonquirks.blogspot.ru/2010/07/absolutely-relative-import.html
you are importing a field ,you must import a class from items.py
like from myproject.items import class_name.
So, this was a problem that I came across the other day that I was able to fix through some trial and error, but I wasn't able to find any documentation of it so I thought I'd put this up in case anyone happens to run into the same problem I did.
This isn't so much an issue with scrapy as it is an issue with naming files and how python deals with importing modules. Basically the problem is that if you name your spider file the same thing as the project then your imports are going to break. Python will try to import from the directory closest to your current position which means it's going to try to import from the spider's directory which isn't going to work.
Basically just change the name of your spider file to something else and it'll all be up and running just fine.
if the structure like this:
package/
__init__.py
subpackage1/
__init__.py
moduleX.py
moduleY.py
subpackage2/
__init__.py
moduleZ.py
moduleA.py
and if you are in moduleX.py, the way to import other modules can be:
from .moduleY.py import *
from ..moduleA.py import *
from ..subpackage2.moduleZ.py import *
refer:PEP Imports: Multi-Line and Absolute/Relative
Related
I cannot load scrapy items to scrapy spiders. Here is my project structure:
-rosen
.log
.scrapers
..scrapers
...spiders
....__init__.py
....exampleSpider.py
...__init__.py
...items.py
...middlewares.py
...pipelines.py
...settings.py
.src
..__init__.py
..otherStuff.py
.tmp
This structure has been created using scrapy startproject scrapers inside of rosen project (directory).
Now, the items.py has the following code:
import scrapy
from Decimal import Decimal
class someItem(scrapy.Item):
title: str = scrapy.Field(serializer=str)
bid: Decimal = scrapy.Field(serializer=Decimal)
And the exampleSpider.py has the following code:
import scrapy
from __future__ import absolute_import
from scrapy.loader import ItemLoader
from scrapers.scrapers.items import someItem
class someSpider(scrapy.Spider):
name = "some"
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._some_fields = someItem()
def parse(self, response) -> None:
some_loader = ItemLoader(item=self._some_fields, response=response)
print(self._some_fields.keys())
The error I get is the following: runspider: error: Unable to load 'someSpider.py': No module named 'scrapers.scrapers'
I found Scrapy: ImportError: No module named items and tried all three solutions by renaming and adding from __future__ import absolute_import. Nothing helps. Please advice.
The command that I execute is scrapy runspider exampleSpider.py. I tried it from the spiders and rosen directories.
i do not see any virtualenv inside your directory. So i recommend you to do so eg. under 'rosen'.
you can try this:
try:
from scrapers.items import someItem
except FileNotFoundError:
from scrapers. scrapers.items import someItem
then cal it with:
scrapy crawl NameOfSpider
or:
scrapy runspider path/to/spider.py
I've been trying to write a Python file to scrape the whole content of a page of a website. Now, everything seems to be fine in my code, until I run it.
I've made sure to link the items from the items python file. I shouldn't get any errors, but yet I keep getting "ValueError: attempted relative import beyond top-level package"
Here is my code from my main python file:
import scrapy
from ..items import AnalogicScrapeItem
class AnalogicSpider(scrapy.Spider):
name = 'analogic'
start_urls = ['https://www.analogic.com/about/']
def parse(self, response):
items = AnalogicScrapeItem()
body1 = response.css('body').css('::text').extract()
items['body1'] = body1
yield items
Here is my code from items.py file:
import scrapy
class AnalogicScrapeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
body1 = scrapy.Field()
After running the code, here is the error I get:
Traceback (most recent call last):
File "C:/Users/Kev/PycharmProjects/whole_page_extract3/analogic_scrape/
analogic_scrape/spiders/analogic.py", line 3, in <module>
from ..items import AnalogicScrapeItem
ValueError: attempted relative import beyond top-level package
Any help resolving this issue would be greatly appreciated, thank you!
from analogic_scrape.items import AnalogicScrapeItem
would do the job. When you use .., you are importing files from a relative path.
However, if you run the script from command line with scrapy crawl analogic, relative imports are not a problem.
I've been trying to import some python classes which are defined in a child directory. The directory structure is as follows:
workspace/
__init__.py
main.py
checker/
__init__.py
baseChecker.py
gChecker.py
The baseChecker.py looks similar to:
import urllib
class BaseChecker(object):
# SOME METHODS HERE
The gChecker.py file:
import baseChecker # should import baseChecker.py
class GChecker(BaseChecker): # gives a TypeError: Error when calling the metaclass bases
# SOME METHODS WHICH USE URLLIB
And finally the main.py file:
import ?????
gChecker = GChecker()
gChecker.someStuff() # which uses urllib
My intention is to be able to run main.py file and call instantiate the classes under the checker/ directory. But I would like to avoid importing urllib from each file (if it is possible).
Note that both the __init__.py are empty files.
I have already tried calling from checker.gChecker import GChecker in main.py but a ImportError: No module named checker.gChecker shows.
In the posted code, in gChecker.py, you need to do
from baseChecker import BaseChecker
instead of import baseChecker
Otherwise you get
NameError: name 'BaseChecker' is not defined
Also with the mentioned folders structure you don't need checker module to be in the PYTHONPATH in order to be visible by main.py
Then in main.y you can do:
from checker import gChecker.GChecker
I want to test my scrapy spider. I want to import spider to a test file an make a test spider and override start_urls, but I have a problem with importing it. Here is a project structure
...product-scraper\test_spider.py
...product-scraper\oxygen\oxygen\spiders\oxygen_spider.py
...product-scraper\oxygen\oxygen\items.py
the problem is that spider import Product class from items.py
from oxygen.items import Product
ImportError: No module named items
cmdscrapy crawl oxygen_spider works
I tried change sys.path or site.addsitedir in all possible ways
basedir = os.path.abspath(os.path.dirname(__file__))
module_path = os.path.join(basedir, "oxygen\\oxygen")
sys.path.append(basedir) # module_path
no success :(
I use python 2.7 on windows
Do you really get the error "No module named items"? Or is it something like "No module named oxygen.items"?
Also I'm not really sure why you would want to use os.path commands. Wouldn't this just work:
from items import Product
So without the "oxygen." This would however, as far as I know, only work if Product is a class in your items.py. If it's not a class I would suggest to just use:
import items
If that does not work, please specify what Product is in your items.py
I'm having an issue with the import statement in Python 3. I'm following a book (Python 3 Object Oriented) and am having the following structure:
parent_directory/
main.py
ecommerce/
__init__.py
database.py
products.py
payments/
__init__.py
paypal.py
authorizenet.py
In paypal.py, I'm trying to use the Database class from database.py. So I tried this:
from ecommerce.database import Database
I get this error:
ImportError: No module named 'ecommerce'
so I try with both of these import statements:
from .ecommerce.database import Database
from ..ecommerce.database import Database
and I get this error:
SystemError: Parent module '' not loaded, cannot perform relative import
What am I doing wrong or missing?
Thank you for your time!
Add your parent_directoryto Python's search path. For example so:
import sys
sys.path.append('/full/path/to/parent_directory')
Alternatively, you can add parent_directory to the environmental variable PYTHONPATH.