I've been trying to write a Python file to scrape the whole content of a page of a website. Now, everything seems to be fine in my code, until I run it.
I've made sure to link the items from the items python file. I shouldn't get any errors, but yet I keep getting "ValueError: attempted relative import beyond top-level package"
Here is my code from my main python file:
import scrapy
from ..items import AnalogicScrapeItem
class AnalogicSpider(scrapy.Spider):
name = 'analogic'
start_urls = ['https://www.analogic.com/about/']
def parse(self, response):
items = AnalogicScrapeItem()
body1 = response.css('body').css('::text').extract()
items['body1'] = body1
yield items
Here is my code from items.py file:
import scrapy
class AnalogicScrapeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
body1 = scrapy.Field()
After running the code, here is the error I get:
Traceback (most recent call last):
File "C:/Users/Kev/PycharmProjects/whole_page_extract3/analogic_scrape/
analogic_scrape/spiders/analogic.py", line 3, in <module>
from ..items import AnalogicScrapeItem
ValueError: attempted relative import beyond top-level package
Any help resolving this issue would be greatly appreciated, thank you!
from analogic_scrape.items import AnalogicScrapeItem
would do the job. When you use .., you are importing files from a relative path.
However, if you run the script from command line with scrapy crawl analogic, relative imports are not a problem.
Related
I have about a dozen python module imports that are going to be reused on many different scrapers, and I would love to just throw them into a single file (scraper_functions.py) that also contains a bunch of functions, like this:
import smtplib
import requests
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
import time
def function_name(var1)
# function code here
then in my scraper I would simply do something like:
import scraper_functions
and be done with it. But listing the imports at the top of scraper_functions.py doesn't work, and neither does putting all the imports in a function. In each case I get errors in the scraper that is doing the importing.
Traceback (most recent call last):
File "{actual-scraper-name-here}.py", line 24, in <module>
x = requests.get(main_url)
NameError: name 'requests' is not defined
In addition, in VSCode, under Problems, I get errors like
Undefined variable 'requests' pylint(undefined-variable) [24,5]
None of the modules are recognized. I have made sure that all files are in the same directory.
Is such a thing possible please?
You need to either use the scraper_functions prefix (same way you do this import name) or use the from keyword to import your things from scraper_functions with the * selector.
Using the form keyword (Recommended)
from scraper_functions import * # import everything with *
...
x = requests.get(main_url)
Using the scraper_functions prefix (Not recommended)
import scraper_functions
...
x = scraper_functions.requests.get(main_url)
I'm trying to import the class from a spider folder, file but it gives me the error.
I used following method to import the class:
from .spiders.amazon_Spider import amazonSpider
Following is my files enlignment:
-amazonWebScraping
-amazonWebScraping
-spiders
-amazon_spiders.py
-scraper.py
I'm trying to access class amazonSpider(scrapy.Spider) from amazon_spider.py file, in scraper.py file but it's giving me the error
ImportError: attempted relative import with no known parent package
I cannot load scrapy items to scrapy spiders. Here is my project structure:
-rosen
.log
.scrapers
..scrapers
...spiders
....__init__.py
....exampleSpider.py
...__init__.py
...items.py
...middlewares.py
...pipelines.py
...settings.py
.src
..__init__.py
..otherStuff.py
.tmp
This structure has been created using scrapy startproject scrapers inside of rosen project (directory).
Now, the items.py has the following code:
import scrapy
from Decimal import Decimal
class someItem(scrapy.Item):
title: str = scrapy.Field(serializer=str)
bid: Decimal = scrapy.Field(serializer=Decimal)
And the exampleSpider.py has the following code:
import scrapy
from __future__ import absolute_import
from scrapy.loader import ItemLoader
from scrapers.scrapers.items import someItem
class someSpider(scrapy.Spider):
name = "some"
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._some_fields = someItem()
def parse(self, response) -> None:
some_loader = ItemLoader(item=self._some_fields, response=response)
print(self._some_fields.keys())
The error I get is the following: runspider: error: Unable to load 'someSpider.py': No module named 'scrapers.scrapers'
I found Scrapy: ImportError: No module named items and tried all three solutions by renaming and adding from __future__ import absolute_import. Nothing helps. Please advice.
The command that I execute is scrapy runspider exampleSpider.py. I tried it from the spiders and rosen directories.
i do not see any virtualenv inside your directory. So i recommend you to do so eg. under 'rosen'.
you can try this:
try:
from scrapers.items import someItem
except FileNotFoundError:
from scrapers. scrapers.items import someItem
then cal it with:
scrapy crawl NameOfSpider
or:
scrapy runspider path/to/spider.py
I want to test my scrapy spider. I want to import spider to a test file an make a test spider and override start_urls, but I have a problem with importing it. Here is a project structure
...product-scraper\test_spider.py
...product-scraper\oxygen\oxygen\spiders\oxygen_spider.py
...product-scraper\oxygen\oxygen\items.py
the problem is that spider import Product class from items.py
from oxygen.items import Product
ImportError: No module named items
cmdscrapy crawl oxygen_spider works
I tried change sys.path or site.addsitedir in all possible ways
basedir = os.path.abspath(os.path.dirname(__file__))
module_path = os.path.join(basedir, "oxygen\\oxygen")
sys.path.append(basedir) # module_path
no success :(
I use python 2.7 on windows
Do you really get the error "No module named items"? Or is it something like "No module named oxygen.items"?
Also I'm not really sure why you would want to use os.path commands. Wouldn't this just work:
from items import Product
So without the "oxygen." This would however, as far as I know, only work if Product is a class in your items.py. If it's not a class I would suggest to just use:
import items
If that does not work, please specify what Product is in your items.py
I have a very basic spider, following the instructions in the getting started guide, but for some reason, trying to import my items into my spider returns an error. Spider and items code is shown below:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from myProject.items import item
class MyProject(BaseSpider):
name = "spider"
allowed_domains = ["website.com"]
start_urls = [
"website.com/start"
]
def parse(self, response):
print response.body
from scrapy.item import Item, Field
class ProjectItem(Item):
title = Field()
When I run this code scrapy either can't find my spider, or can't import my items file. What's going on here? This should be a really example to run right?
I also had this several times while working with scrapy. You could add at the beginning of your Python modules this line:
from __future__ import absolute_import
More info here:
http://www.python.org/dev/peps/pep-0328/#rationale-for-absolute-imports
http://pythonquirks.blogspot.ru/2010/07/absolutely-relative-import.html
you are importing a field ,you must import a class from items.py
like from myproject.items import class_name.
So, this was a problem that I came across the other day that I was able to fix through some trial and error, but I wasn't able to find any documentation of it so I thought I'd put this up in case anyone happens to run into the same problem I did.
This isn't so much an issue with scrapy as it is an issue with naming files and how python deals with importing modules. Basically the problem is that if you name your spider file the same thing as the project then your imports are going to break. Python will try to import from the directory closest to your current position which means it's going to try to import from the spider's directory which isn't going to work.
Basically just change the name of your spider file to something else and it'll all be up and running just fine.
if the structure like this:
package/
__init__.py
subpackage1/
__init__.py
moduleX.py
moduleY.py
subpackage2/
__init__.py
moduleZ.py
moduleA.py
and if you are in moduleX.py, the way to import other modules can be:
from .moduleY.py import *
from ..moduleA.py import *
from ..subpackage2.moduleZ.py import *
refer:PEP Imports: Multi-Line and Absolute/Relative