Scraping large number of static html.gz files in scrapy - python

I have a scrapy spider that looks for static html files on disk using the file:/// command as a start url, but I'm unable to load the gzip files and loop through my directory of 150,000 files which all have the .html.gz suffix, I've tried several different approaches that I have commented out but nothing works so far, my code so far looks as
from scrapy.spiders import CrawlSpider
from Scrapy_new.items import Scrapy_newTestItem
import gzip
import glob
import os.path
class Scrapy_newSpider(CrawlSpider):
name = "info_extract"
source_dir = '/path/to/file/'
allowed_domains = []
start_urls = ['file://///path/to/files/.*html.gz']
def parse_item(self, response):
item = Scrapy_newTestItem()
item['user'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div[2]/div/div[2]/div[1]/h1/span[2]/text()').extract()
item['list_of_links'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div[2]/div/div[3]/div[3]/a/#href').extract()
item['list_of_text'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div/div/div/div/a/text()').extract()
Running this gives the error code
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/file.py", line 13, in download_request
with open(filepath, 'rb') as fo:
IOError: [Errno 2] No such file or directory: 'path/to/files/*.html'
Changing my code so that the files are first unziped and then passed through as follow:
source_dir = 'path/to/files/'
for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
base = os.path.basename(src_name)
with gzip.open(src_name, 'rb') as infile:
#start_urls = ['/path/to/files*.html']#
file_cont = infile.read()
start_urls = file_cont#['file:////file_cont']
Gives the following error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: %3C

You don't have to use start_urls always on a scrapy spider. Also CrawlSpider is commonly used in conjunction with rules for specifying routes to follow and what to extract in big crawling sites, you'll maybe want to use scrapy.Spider directly instead of CrawlSpider.
Now, the solution relies on using the start_requests method that a scrapy spider offers, which handles the first requests of the spider. If this method is implemented in your spider, start_urls won't be used:
from scrapy import Spider
import gzip
import glob
import os
class ExampleSpider(Spider):
name = 'info_extract'
def start_requests(self):
os.chdir("/path/to/files")
for file_name in glob.glob("*.html.gz"):
f = gzip.open(file_name, 'rb')
file_content = f.read()
print file_content # now you are reading the file content of your local files
Now, remember that start_requests must return an iterable of requests, which ins't the case here, because you are only reading files (I assume you are going to create requests later with the content of those files), so my code will be failing with something like:
CRITICAL:
Traceback (most recent call last):
...
/.../scrapy/crawler.py", line 73, in crawl
start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable
Which points that I am not returning anything from my start_requests method (None), which isn't iterable.

Scrapy will not be able to deal with the compressed html files, you have to extract them first. This can be done on-the-fly in Python or you just extract them on operating system level.
Related: Python Scrapy on offline (local) data

Related

MissingSchema(error) thrown when following tutorial

I am following "The Complete Python Course: Beginner to Advanced!" In SkillShare, and there is a point where my code breaks while the code in the tutorial continues just fine.
The tutorial is about making a webscraper with BeautifulSoup, Pillow, and IO. I'm supposed to be able to do a search for anything in bing, then save the pictures on the images search results to a folder in my computer.
Here's the Code:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for:")
params = {"q": search}
r = requests.get("http://bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "iusc"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images" + title, img.format)
Whenever I run it, at the end it gives me a raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL
I tried adding
img_obj = requests.get("https://" + item.attrs["href"])
but it keeps giving me the same error.
I have gone and looked at the bing code, and the only change I have done is change the "thumb" class to "iusc". I tried using the "thumb" class as in the tutorial but then the program just runs without saving anything and eventually just finishes.
Thank you for your help
EDIT: Here is the whole error that is being thrown, as requested by baileythegreen:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 14, in <module>
img_obj = requests.get(item.attrs["href"])
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 515, in request
prep = self.prepare_request(req)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 443, in prepare_request
p.prepare(
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 318, in prepare
self.prepare_url(url, params)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 392, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0': No scheme supplied. Perhaps you meant http:///images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0?
Edit 2: I followed hawschiat instructions, and I am getting a different error this time:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 15, in <module>
print("getting", item.attrs["href"])
KeyError: 'href'
However, if I keep the "src" attribute in the print line, I get
getting http://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 18, in <module>
img.save(r'C://Users/user/PycharmProjects/webscrapery/scraped_images' + title, img.format)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 2209, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 22] Invalid argument: 'C://Users/user/PycharmProjects/webscrapery/scraped_imageshttp://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7'
I tried using the 'r' character in front of the C: path, but it keeps giving me the same error. I also tried to change the forward slashes to back slashes, and putting 2 slashes in front of the C. I also made sure I have permission to write on the scrapped_images folder, which I do, as well as webscrapery.
The last line of your stack trace gives you a hint of the cause of the error. The URL scraped from the webpage is not a full URL, but rather the path to the resource.
To make it a full URL, you can simply prepend it with the scheme and authority. In your case, that would be https://bing.com.
That being said, I don't think the URL you obtained is actually the URL to the image. Inspecting Bing Image's webpage using Chrome's developer tool, we can see that the structure of the page looks something like this:
Notice that the anchor (a) element points to the preview page while its child element img contains the actual path to the resource.
With that in mind, we can rewrite your code to something like:
links = soup.findAll("img", {"class": "mimg"})
for item in links:
img_obj = requests.get(item.attrs["src"])
print("getting", item.attrs["src"])
title = item.attrs["src"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images\\" + title, img.format)
And this should achieve what you are trying to do.

How Can I Fix "TypeError: Cannot mix str and non-str arguments"?

I'm writing some scraping codes and experiencing an error as above.
My code is following.
# -*- coding: utf-8 -*-
import scrapy
from myproject.items import Headline
class NewsSpider(scrapy.Spider):
name = 'IC'
allowed_domains = ['kosoku.jp']
start_urls = ['http://kosoku.jp/ic.php']
def parse(self, response):
"""
extract target urls and combine them with the main domain
"""
for url in response.css('table a::attr("href")'):
yield(scrapy.Request(response.urljoin(url), self.parse_topics))
def parse_topics(self, response):
"""
pick up necessary information
"""
item=Headline()
item["name"]=response.css("h2#page-name ::text").re(r'.*(インターチェンジ)')
item["road"]=response.css("div.ic-basic-info-left div:last-of-type ::text").re(r'.*道$')
yield item
I can get the correct response when I do them individually on a shell script, but once it gets in a programme and run, it doesn't happen.
2017-11-27 18:26:17 [scrapy.core.scraper] ERROR: Spider error processing <GET http://kosoku.jp/ic.php> (referer: None)
Traceback (most recent call last):
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/sonogi/scraping/myproject/myproject/spiders/IC.py", line 16, in parse
yield(scrapy.Request(response.urljoin(url), self.parse_topics))
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/http/response/text.py", line 82, in urljoin
return urljoin(get_base_url(self), url)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py", line 424, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py", line 120, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2017-11-27 18:26:17 [scrapy.core.engine] INFO: Closing spider (finished)
I'm so confused and appreciate anyone's help upfront!
According to the Scrapy documentation, the .css(selector) method that you're using, returns a SelectorList instance. If you want the actual (unicode) string version of the url, call the extract() method:
def parse(self, response):
for url in response.css('table a::attr("href")').extract():
yield(scrapy.Request(response.urljoin(url), self.parse_topics))
You're getting this error because of code at line 15.
As response.css('table a::attr("href")') returns the object of type list so you've to first convert the type of url from list to str and then you can parse your code to another function.
further the attr syntax might will lead you an error as the correct attr tag doesn't has "" so instead of a::attr("href") it would be a::attr(href).
So after removing above two issues the code will look something like this:
def parse(self, response):
"""
extract target urls and combine them with the main domain
"""
url = response.css('table a::attr(href)')
url_str = ''.join(map(str, url)) #coverts list to str
yield response.follow(url_str, self.parse_topics)

Problems writing Scraped data to csv with Slavic characters (UnicodeEncodeError & TypeError)

Intention / Wanted result:
To scrape the link titles (i.e. the text of the links with each item) from a Czech website:
https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha
And print out the result in a CSV file. Preferably in a list so that I can later manipulate the data in another Python Data analytics model.
Result / Problem:
I am getting an UnicodeEncodeError and a TypeError. I suspect this has to do with the non-normal characters that exist in the Czech Language. Please see below for traceback.
Traceback:
TypeError Traceback:
2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
b'tna, Praha - Kr\xc4\x8d '}
Traceback (most recent call last):
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
self.exporter.export_item(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 193, in export_item
self._write_headers_and_set_fields_to_export(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 217, in _write_headers_and_set_fields_to_export
self.csv_writer.writerow(row)
File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 718, in write
return self.writer.write(data)
File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 376, in write
data, consumed = self.encode(object, self.errors)
TypeError: Can't convert 'bytes' object to str implicitly
UnicodeEncodeError Traceback:
2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
b'tna, Praha - Kr\xc4\x8d '}
Traceback (most recent call last):
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
self.exporter.export_item(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 198, in export_item
self.csv_writer.writerow(values)
File "C:\Users\phili\Anaconda3\envs\py35\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u011b' in position 37: character maps to <undefined>
Situation / Process:
I am running the scrapy crawl bezrealitky (i.e. name of spider). I have configured the pipeline with a CSVItemExporter I found on the internet, and tried to adapt it to UTF-8 encode when opening the file (I also tried in the beginning without adding UTF-8, but same error).
My pipeline code:
from scrapy.conf import settings
from scrapy.exporters import CsvItemExporter
import codecs
class CsvPipeline(object):
def __init__(self):
self.file = codecs.open("booksdata.csv", 'wb', encoding='UTF-8')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
My settings file:
BOT_NAME = 'bezrealitky'
SPIDER_MODULES = ['bezrealitky.spiders']
NEWSPIDER_MODULE = 'bezrealitky.spiders'
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'bezrealitky.pipelines.CsvPipeline': 300,
My spider code:
class BezrealitkySpider(scrapy.Spider):
name = 'bezrealitky'
start_urls = [
'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
]
def parse(self, response):
item = BezrealitkyItem()
items = []
for records in response.xpath('//*[starts-with(#class,"record")]'):
item['title'] = response.xpath('.//div[#class="details"]/h2/a[#href]/text()').extract()[1].encode('utf-8')
items.append(item)
return(items)
Solutions tried so far:
To add and remove .encode('utf-8) to the extract() command, and also in the pipeline.py but it didn't work.
Also tried adding # -- coding: utf-8 -- to the beginning, didn't work either
I tried to change the python code to utf-8 in the console with this:
chcp 65001
set PYTHONIOENCODING=utf-8
Conclusion:
I am cannot get the scraped data to write to the CSV file, the CSV is created but there is nothing in it. Even though in the shell I can see that data is scraped but it isn't decoded / encoded properly and throws an error before it is writte to file.
I am complete beginner with this, just trying to pick up Scrapy. Would really appreciate any help I can get!
What I use in order to scrape Czech websites and avoid this errors is unidecode module. What this module does is an ASCII transliterations of Unicode text.
# -*- coding: utf-8 -*-
from unidecode import unidecode
class BezrealitkySpider(scrapy.Spider):
name = 'bezrealitky'
start_urls = [
'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
]
def parse(self, response):
item = BezrealitkyItem()
items = []
for records in response.xpath('//*[starts-with(#class,"record")]'):
item['title'] = unidecode(response.xpath('.//div[#class="details"]/h2/a[#href]/text()').extract()[1].encode('utf-8'))
items.append(item)
return(items)
Since I use an ItemLoader my code look kind of like this:
# -*- coding: utf-8 -*-
from scrapy.loader import ItemLoader
class BaseItemLoader(ItemLoader):
title_in = MapCompose(unidecode)

File Create/Write Issue In Python

I'm trying to create and write to a file. I have the following code:
from urllib2 import urlopen
def crawler(seed_url):
to_crawl = [seed_url]
crawled=[]
while to_crawl:
page = to_crawl.pop()
page_source = urlopen(page)
s = page_source.read()
with open(str(page)+".txt","a+") as f:
f.write(s)
f.close()
return crawled
if __name__ == "__main__":
crawler('http://www.yelp.com/')
However, it returns the error:
Traceback (most recent call last):
File "/Users/adamg/PycharmProjects/NLP-HW1/scrape-test.py", line 29, in <module>
crawler('http://www.yelp.com/')
File "/Users/adamg/PycharmProjects/NLP-HW1/scrape-test.py", line 14, in crawler
with open("./"+str(page)+".txt","a+") as f:
IOError: [Errno 2] No such file or directory: 'http://www.yelp.com/.txt'
I thought that open(file,"a+") is supposed to create and write. What am I doing wrong?
If you want to use the URL as the basis for the directory, you should encode the URL. That way, slashes (among other characters) will be converted to character sequences which won't interfere with the file system/shell.
The urllib library can help with this.
So, for example:
>>> import urllib
>>> urllib.quote_plus('http://www.yelp.com/')
'http%3A%2F%2Fwww.yelp.com%2F'

Scrapy throwing up Traceback when trying to parse tabulated data

I am running Scrapy.org version 2.7 64 bit on Windows Vista 64 bit. I have some Scrapy code that is trying parse data contained within a table at the URL contained within the following code:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
from scrapy.cmdline import execute
import re
class MySpider(Spider):
name = "wiki"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]
def parse(self, response):
for row in response.selector.xpath('//table[#id="player-fixture"]//tr[td[#class="tournament"]]'):
# Is this row contains goal symbols?
list_of_goals = row.xpath('//span[#title="Goal"')
if list_of_goals:
print remove_tags(list_of_goals).encode('utf-8')
execute(['scrapy','crawl','wiki'])
However, it is throwing up the following error:
Traceback (most recent call last):
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 383, in callback
self._startRunCallbacks(result)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 491, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 578, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\Python27\lib\site-packages\scrapy\spider.py", line 56, in parse
raise NotImplementedError
exceptions.NotImplementedError:
Can anyone tell me what the issue is here? I am trying to get a screen print of all items in the table, including the data in the goals and assists column.
Thanks
Your indentation is wrong:
class MySpider(Spider):
name = "wiki"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]
def parse(self, response):
for row in response.selector.xpath('//table[#id="player-fixture"]//tr[td[#class="tournament"]]'):
# Is this row contains goal symbols?
list_of_goals = row.xpath('//span[#title="Goal"')
if list_of_goals:
print remove_tags(list_of_goals).encode('utf-8')
Implementing a parse method is a requirement when you use the Spider class, this is what the method is like in the source code:
def parse(self, response):
raise NotImplementedError
Your indentation was wrong so parse was not part of the class and therefore you had not implemented the required method.
The raise NotImplementedError is there to ensure you write the required parse method when inheriting from the Spider base class.
You now just have to find the correct xpath ;)

Categories

Resources