I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off.
From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead?
Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok.
You could save the pdf in the spider callback:
def parse_listing(self, response):
# ... extract pdf urls
for url in pdf_urls:
yield Request(url, callback=self.save_pdf)
def save_pdf(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f:
f.write(response.body)
If you choose to do it in a pipeline:
# in the spider
def parse_pdf(self, response):
i = MyItem()
i['body'] = response.body
i['url'] = response.url
# you can add more metadata to the item
return i
# in your pipeline
def process_item(self, item, spider):
path = self.get_path(item['url'])
with open(path, "wb") as f:
f.write(item['body'])
# remove body and add path as reference
del item['body']
item['path'] = path
# let item be processed by other pipelines. ie. db store
return item
[1] another approach could be only store pdfs' urls and use another process to fetch the documents without buffering into memory. (e.g. wget)
There is a FilesPipeline that you can use directly, assuming you already have the file url, the link shows how to use FilesPipeline:
https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ
It's a perfect tool for the job. The way Scrapy works is that you have spiders that transform web pages into structured data(items). Pipelines are postprocessors, but they use same asynchronous infrastructure as spiders so it's perfect for fetching media files.
In your case, you'd first extract location of PDFs in spider, fetch them in pipeline and have another pipeline to save items.
Related
I would like to download all PDFs found on a site, e.g. https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html. I also tried to use rules but I think it's not neccessary here.
This is my approach:
import scrapy
from scrapy.linkextractors import IGNORED_EXTENSIONS
CUSTOM_IGNORED_EXTENSIONS = IGNORED_EXTENSIONS.copy()
CUSTOM_IGNORED_EXTENSIONS.remove('pdf')
class PDFParser(scrapy.Spider):
name = 'stadt_koeln_amtsblatt'
# URL of the pdf file
start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']
rules = (
Rule(LinkExtractor(allow=r'.*\.pdf', deny_extensions=CUSTOM_IGNORED_EXTENSIONS), callback='parse', follow=True),
)
def parse(self, response):
# selector of pdf file.
for pdf in response.xpath("//a[contains(#href, 'pdf')]"):
yield scrapy.Request(
url=response.urljoin(pdf),
callback=self.save_pdf
)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
It seems there are two problems. The first one when extracting all the pdf links with xpath:
TypeError: Cannot mix str and non-str arguments
and the second problem is about handling the pdf file itself. I just want to store it locally in a specific folder or similar. It would be really great if someone has a working example for this kind of site.
To download files you need to use the FilesPipeline. This requires that you enable it in ITEM_PIPELINES and then provide a field named file_urls in your yielded item. In the example below, I have created an extenstion of the FilesPipeline in order to retain the filename of the pdf as provided on the website. The files will be saved in a folder named downloaded_files in the current directory
Read more about the filespipeline from the docs
import scrapy
from scrapy.pipelines.files import FilesPipeline
class PdfPipeline(FilesPipeline):
# to save with the name of the pdf from the website instead of hash
def file_path(self, request, response=None, info=None):
file_name = request.url.split('/')[-1]
return file_name
class StadtKoelnAmtsblattSpider(scrapy.Spider):
name = 'stadt_koeln_amtsblatt'
start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']
custom_settings = {
"ITEM_PIPELINES": {
PdfPipeline: 100
},
"FILES_STORE": "downloaded_files"
}
def parse(self, response):
links = response.xpath("//a[#class='download pdf pdf']/#href").getall()
links = [response.urljoin(link) for link in links] # to make them absolute urls
yield {
"file_urls": links
}
I am trying to develop a Python Script for my Data Engineering Project and I want to loop over 47 URLS stored in a dataframe, which downloads a CSV File and stores in my local machine. Below is the example of top 5 URLS:
test_url = "https://data.cdc.gov/api/views/pj7m-y5uh/rows.csv?accessType=DOWNLOAD"
req = requests.get(test_url)
url_content = req.content
csv_file = open('cdc6.csv', 'wb')
csv_file.write(url_content)
csv_file.close()
I have this for a single file, but instead of the opening a CSV File and writing the Data in it, I want to directly download all the files and save it in local machine.
You want to iterate and then download the file to a folder. Iteration is easy by using the .items() method in pandas dataframes and passing it into a loop. See the documentation here.
Then, you want to download each item. Urllib has a .urlretrieve(url, filename) function for downloading a hosted file to a local file, which is elaborated on in the urllib documentation here.
Your code may look like:
for index, url in url_df.items():
urllib.urlretrieve(url, "cdcData" + index + ".csv")
or if you want to preserve the original names:
for index, url in url_df.items():
name = url.split("/")[-1]
urllib.urlretrieve(url, name)
I am trying to get data from multiple start_URLs using the same scrapy spider file. My goal is to create multiple URLs by changing a particular ID in a web-address and run the spider in the sequence of IDs. All the IDs are saved in a CSV file. The formal name of my ID is CIK. For simplicity, I put two CIKs here (in the original file, I have around 19000 CIKs).
1326801
320193
So the manually created website should look like this:
https://www.secform4.com/insider-trading/1326801-0.htm
https://www.secform4.com/insider-trading/320193-0.htm
My question is: how can I import the CIKs saved in the CSV file, command scrapy spider to manually build Start_URLs and run the created URLs sequentially?
Also, some of these CIKs do not have data on the specific website. How may I command spider to ignore the unavailable URLs manually created?
I am just a beginner. If possible, please suggest me the specific changes in my code (specific code would be highly appreciated). Thank you in advance.
import scrapy
class InsiderSpider(scrapy.Spider):
name = 'insider'
cik = 320193
allowed_domains = ['www.secform4.com']
start_urls = ['https://www.secform4.com/insider-trading/'+ str(cik) +'-0.htm']
It's possible to write all urls into start_urls, but this is not best practise.
Use
class MySpider(Spider):
name = 'csv'
def start_requests(self):
with open('file.csv') as f:
for line in f:
if not line.strip():
continue
yield Request(line)
shown in:
How to loop through multiple URLs to scrape from a CSV file in Scrapy?
instead.
df = '1326801', '320193'
urls = ['https://www.secform4.com/insider-trading/' + str(i) +'-0.htm' for i in df]
print(urls)
['https://www.secform4.com/insider-trading/1326801-0.htm', 'https://www.secform4.com/insider-trading/320193-0.htm']
I have been stuck for hours on how to write my crawled data into multiple files. I wrote a code that scraps a website and extracts all the body of each link in the website. An example is crawling news website and you extract all the links and then extracts all the body of each links. I have done that succesffully But now my concern now is that instead of storing them all into a file using this code below
def save_data(data):
the_file = open('raw_data.txt', 'w')
for title_text, body_content, url in data:
the_file.write("%s\n" % [title_text, body_content, url])
how do I write the code such that I store each article in a different file. So I would be having something like Article_00, Article_01, Article_01...
Thanks
If you want to save the data in multiple files, then you must open multiple files for writing.
Use enumerate to get a counter for which data set you're iterating over, so you can use it in the filename like this:
def save_data(data):
for i, (title_text, body_content, url) in enumerate(data):
file = open('Article_%02d' % (i,), 'w+')
file.write("%s\n" % [title_text, body_content, url])
file.close()
I am new to Python and have a problem using Scrapy. I need to download some PDF files from URLs (The URLs point to PDFs, but there is no .pdf in them.) and store them in a directory.
So far I have populated my items with title (as you can see I have passed the title as metadata of my previous request) and the body (which I get from the response body of my last request).
When it uses the with open function in my code, though, I always get an error back from the terminal like this:
exceptions.IOError: [Errno 2] No such file or directory:
Here is my code:
def parse_objects:
....
item = Item()
item['title'] = titles.xpath('text()').extract()
item['url'] = titles.xpath('a[#class="title"]/#href').extract()
request = Request(item['url'][0], callback = self.parse_urls)
request.meta['item'] = item
yield request
def parse_urls(self,response):
item = response.meta['item']
item['desc'] = response.body
with open(item['title'][1], "w") as f:
f.write(response.body)
I am using item['title'][1] because the title field is a list, and I need to save the PDF file using the second item which is the name. As far as I know, when I use with open and there is no such a file, Python creates a file automatically.
I'm using Python 3.4.
Can anyone help?
First you have find the Xpath of the URL, that you need to download.
And save those links into one list.
Import the python module name called Urllib { import urllib }
Use the keyword urllib.urlretrieve to download the PDF files.
Ex.,
import urllib
url=[]
url.append(hxs.select('//a[#class="df"]/#href').extract())
for i in range(len(url)):
urllib.urlretrieve(url[i],filename='%s'%i)