I'm trying to create and write to a file. I have the following code:
from urllib2 import urlopen
def crawler(seed_url):
to_crawl = [seed_url]
crawled=[]
while to_crawl:
page = to_crawl.pop()
page_source = urlopen(page)
s = page_source.read()
with open(str(page)+".txt","a+") as f:
f.write(s)
f.close()
return crawled
if __name__ == "__main__":
crawler('http://www.yelp.com/')
However, it returns the error:
Traceback (most recent call last):
File "/Users/adamg/PycharmProjects/NLP-HW1/scrape-test.py", line 29, in <module>
crawler('http://www.yelp.com/')
File "/Users/adamg/PycharmProjects/NLP-HW1/scrape-test.py", line 14, in crawler
with open("./"+str(page)+".txt","a+") as f:
IOError: [Errno 2] No such file or directory: 'http://www.yelp.com/.txt'
I thought that open(file,"a+") is supposed to create and write. What am I doing wrong?
If you want to use the URL as the basis for the directory, you should encode the URL. That way, slashes (among other characters) will be converted to character sequences which won't interfere with the file system/shell.
The urllib library can help with this.
So, for example:
>>> import urllib
>>> urllib.quote_plus('http://www.yelp.com/')
'http%3A%2F%2Fwww.yelp.com%2F'
Related
I'm using python 3.6 on Windows 10, I want to download images so that their urls are stored in 1.txt file.
This is my code:
import requests
import shutil
file_image_url = open("test.txt","r")
while True:
image_url = file_image_url.readline()
filename = image_url.split("/")[-1]
r = requests.get(image_url, stream = True)
r.raw.decode_content = True
with open(filename,'wb') as f:
shutil.copyfileobj(r.raw, f)
but when I run the code above it gives me this error:
Traceback (most recent call last):
File "download_pictures.py", line 10, in <module>
with open(filename,'wb') as f:
OSError: [Errno 22] Invalid argument: '03.jpg\n'
test.txt contains:
https://mysite/images/03.jpg
https://mysite/images/26.jpg
https://mysite/images/34.jpg
When I tried to put just one single URL on test.txt, it works and downloaded the picture,
but I need to download several images.
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline.
You are passing this filename(with \n) to open function(hence the OSError). So you need to call strip() on filename before passing into open.
Your filename has the new line character (\n) in it, remove that when you’re parsing for the filename and it should fix your issue. It’s working when you only have one file path in the txt file because there is only one line.
I'm having a hard time reading a pdf from the internet into the python PdfFileReader object.
My code works for the first url, but it doesn't for the second and I don't know how to fix it.
I can see that in the first example, the url refers to a .pdf file and in the second url the pdf is being returned as 'application data' in the html body.
So I think this this might be the issue. Does anybody knows how to fix it so the code also works for the second url?
from pyPdf import PdfFileWriter, PdfFileReader
from io import BytesIO
import requests
def test(url,filename):
response=requests.get(url)
pdf_file = BytesIO(response.content)
existing_pdf = PdfFileReader(pdf_file)
page = existing_pdf.getPage(0)
output = PdfFileWriter()
output.addPage(page)
outputStream = file(filename, "wb")
output.write(outputStream)
outputStream.close()
test('https://s21.q4cdn.com/374334112/files/doc_downloads/test.pdf','works.pdf')
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
This is the stacktrace I have with the second call of the test function:
D:\scripts>test.py
Traceback (most recent call last):
File "D:\scripts\test.py", line 21, in <module>
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
File "D:\scripts\test.py", line 10, in test
page = existing_pdf.getPage(0)
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 450, in getPage
self._flatten()
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
catalog = self.trailer["/Root"].getObject()
File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
return dict.__getitem__(self, key).getObject()
File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 165, in getObject
return self.pdf.getObject(self).getObject()
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted
I found a solution. I imported PyPDF2 instead of pyPdf, so it was probably a bug.
I have a scrapy spider that looks for static html files on disk using the file:/// command as a start url, but I'm unable to load the gzip files and loop through my directory of 150,000 files which all have the .html.gz suffix, I've tried several different approaches that I have commented out but nothing works so far, my code so far looks as
from scrapy.spiders import CrawlSpider
from Scrapy_new.items import Scrapy_newTestItem
import gzip
import glob
import os.path
class Scrapy_newSpider(CrawlSpider):
name = "info_extract"
source_dir = '/path/to/file/'
allowed_domains = []
start_urls = ['file://///path/to/files/.*html.gz']
def parse_item(self, response):
item = Scrapy_newTestItem()
item['user'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div[2]/div/div[2]/div[1]/h1/span[2]/text()').extract()
item['list_of_links'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div[2]/div/div[3]/div[3]/a/#href').extract()
item['list_of_text'] = response.xpath('//*[#id="page-user"]/div[1]/div/div/div/div/div/div/a/text()').extract()
Running this gives the error code
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/file.py", line 13, in download_request
with open(filepath, 'rb') as fo:
IOError: [Errno 2] No such file or directory: 'path/to/files/*.html'
Changing my code so that the files are first unziped and then passed through as follow:
source_dir = 'path/to/files/'
for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
base = os.path.basename(src_name)
with gzip.open(src_name, 'rb') as infile:
#start_urls = ['/path/to/files*.html']#
file_cont = infile.read()
start_urls = file_cont#['file:////file_cont']
Gives the following error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: %3C
You don't have to use start_urls always on a scrapy spider. Also CrawlSpider is commonly used in conjunction with rules for specifying routes to follow and what to extract in big crawling sites, you'll maybe want to use scrapy.Spider directly instead of CrawlSpider.
Now, the solution relies on using the start_requests method that a scrapy spider offers, which handles the first requests of the spider. If this method is implemented in your spider, start_urls won't be used:
from scrapy import Spider
import gzip
import glob
import os
class ExampleSpider(Spider):
name = 'info_extract'
def start_requests(self):
os.chdir("/path/to/files")
for file_name in glob.glob("*.html.gz"):
f = gzip.open(file_name, 'rb')
file_content = f.read()
print file_content # now you are reading the file content of your local files
Now, remember that start_requests must return an iterable of requests, which ins't the case here, because you are only reading files (I assume you are going to create requests later with the content of those files), so my code will be failing with something like:
CRITICAL:
Traceback (most recent call last):
...
/.../scrapy/crawler.py", line 73, in crawl
start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable
Which points that I am not returning anything from my start_requests method (None), which isn't iterable.
Scrapy will not be able to deal with the compressed html files, you have to extract them first. This can be done on-the-fly in Python or you just extract them on operating system level.
Related: Python Scrapy on offline (local) data
I have a simple task but cannot make my code work. I want to loop over the URLs listed in my textfile and download it using wget command in Python. Each URL are placed in separate line in the textfile.
Basically, this is the structure of the list in my textfile:
http://e4ftl01.cr.usgs.gov//MODIS_Composites/MOLT/MOD11C3.005/2000.03.01/MOD11C3.A2000061.005.2007177231646.hdf
http://e4ftl01.cr.usgs.gov//MODIS_Composites/MOLT/MOD11C3.005/2014.12.01/MOD11C3.A2014335.005.2015005235231.hdf
all the URLs are about 178 lines. Then save it in the current working directory.
Below is the initial code that I am working:
import os, fileinput, urllib2 as url, wget
os.chdir("E:/Test/dwnld")
for line in fileinput.FileInput("E:/Test/dwnld/data.txt"):
print line
openurl = wget.download(line)
The error message is:
Traceback (most recent call last): File "E:\Python_scripts\General_purpose\download_url_from_textfile.py", line 5, in <module>
openurl = wget.download(line) File "C:\Python278\lib\site-packages\wget.py", line 297, in download
(fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".") File "C:\Python278\lib\tempfile.py", line 308, in mkstemp
return _mkstemp_inner(dir, prefix, suffix, flags) File "C:\Python278\lib\tempfile.py", line 239, in _mkstemp_inner
fd = _os.open(file, flags, 0600) OSError: [Errno 22] Invalid argument: ".\\MOD11C3.A2000061.005.2007177231646.hdf'\n.frbfrp.tmp"
Try to use urllib.urlretrieve. Check the documentation here: https://docs.python.org/2/library/urllib.html#urllib.urlretrieve
I know you can open files, browsers, and URLs in the Python GUI. However, I don't know how to apply this to programs. For example, none of the below work. (The below are snippets from my growing chat bot program):
def browser():
print('OPENING FIREFOX...')
handle = webbroswer.get() # webbrowser is imported at the top of the file
handle.open('http://youtube.com')
handle.open_new_tab('http://google.com')
and
def file():
file = str(input('ENTER THE FILE\'S NAME AND EXTENSION:'))
action = open(file, 'r')
actionTwo = action.read()
print (actionTwo)
These errors occur, in respect to the above order, but in individual runs:
OPENING FIREFOX...
Traceback (most recent call last):
File "C:/Users/RCOMP/Desktop/Programming/Python Files/AI/COMPUTRON_01.py", line 202, in <module>
askForQuestions()
File "C:/Users/RCOMP/Desktop/Programming/Python Files/AI/COMPUTRON_01.py", line 64, in askForQuestions
browser()
File "C:/Users/RCOMP/Desktop/Programming/Python Files/AI/COMPUTRON_01.py", line 38, in browser
handle = webbroswer.get()
NameError: global name 'webbroswer' is not defined
>>>
ENTER THE FILE'S NAME AND EXTENSION:file.txt
Traceback (most recent call last):
File "C:/Users/RCOMP/Desktop/Programming/Python Files/AI/COMPUTRON_01.py", line 202, in <module>
askForQuestions()
File "C:/Users/RCOMP/Desktop/Programming/Python Files/AI/COMPUTRON_01.py", line 66, in askForQuestions
file()
File "C:/Users/RCOMP/Desktop/Programming/Python Files/AI/COMPUTRON_01.py", line 51, in file
action = open(file, 'r')
IOError: [Errno 2] No such file or directory: 'file.txt'
>>>
Am I handling this wrong, or can I just not use open() and webbrowser in a program?
You should read the errors and try to understand them - they are very helpful in this case - as they often are:
The first one says NameError: global name 'webbroswer' is not defined.
You can see here that webbrowser is spelled wrong in the code. It also tells you the line it finds the error (line 38)
The second one IOError: [Errno 2] No such file or directory: 'file.txt' tells you that you're trying to open a file that doesn't exist. This does not work because you specified
action = open(file, 'r')
which means that you're trying to read a file. Python does not allow reading from a file that does not exist.