Unable to print the files with special characters while using python - python

I developed a web crawler to extract all the source codes in a wiki link. The program terminates after writing a few files.
def fetch_code(link_list):
for href in link_list:
response = urllib2.urlopen("https://www.wikipedia.org/"+href)
content = response.read()
page = open("%s.html" % href, 'w')
page.write(content.replace("[\/:?*<>|]", " "))
page.close()
link_list is an array, which has the extracted links from the seed page.
The error I get after executing is
IOError: [Errno 2] No such file or directory: u'M/s.html'

you cannot create a file with '/' in its name.
you could escape the filename as M%2Fs.html
/ is %2F
in python2, you could simply use urllib to escape the filename, example:
import urllib
filePath = urllib.quote_plus('M/s.html')
print(filePath)
on the other hand, you could also save http response to hierarchy, for example, M/s.html means s.html file under directory named 'M'.

Related

Python3. How to save downloaded webpages to a specified dir?

I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name='Downloaded
Pages/www.python.org#content'> www.python.org#python-network
<_io.BufferedWriter name='Downloaded
Pages/www.python.org#python-network'>
Traceback (most recent call last): File "/Users/Lucas/Python/AP book
exercise/Web Scraping/linkVerification.py", line 26, in
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21]
Is a directory: 'Downloaded Pages/'
I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.
This is my code:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
Appreciate any advice, thanks.
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like #Thyebri said.
And also, remember that the file you write cannot contain characters like '/', '\' or '?'
So, i dont know if the following code it's messy or not, but using the re library i would do the following:
filename = re.sub('[\/*:"?]+', '-', linkUrlToOpen.split("://")[1])
downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')
So, first i remove part i remove the "https://" part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-' and that is the name that will be given to the file.
Hope it works!

python : wget module downloading file without any extension

I am writing small python code to download a file from follow link and retrieve original filename
and its extension.But I have come across one such follow link for which python downloads the file but it is without any extension whereas file has .txt extension when downloads using browser.
Below is the code I am trying :
from urllib.request import urlopen
from urllib.parse import unquote
import wget
filePath = 'D:\\folder_path'
followLink = 'http://example.com/Reports/Download/c4feb46c-8758-4266-bec6-12358'
response = urlopen(followLink)
if response.code == 200:
print('Follow Link(response url) :' + response.url)
print('\n')
unquote_url = unquote(response.url)
file_name = wget.detect_filename(response.url).replace('|', '_')
print('file_name - '+file_name)
wget.download(response.url,filePa
th)
file_name variable in above code is just giving 'c4feb46c-8758-4266-bec6-12358' as filename.
Where I want to download it as c4feb46c-8758-4266-bec6-12358.txt.
I have also tried to read file name from header i.e. response.info(). But not getting proper file name.
Anyone can please help me with this.I am stucked in my work.Thanks in advance.
Wget gets the filename from the URL itself. For example, if your URL was https://someurl.com/filename.pdf, it is saved as filename.pdf. If it was https://someurl.com/filename, it is saved as filename. Since wget.download returns the filename of the downloaded file, you can rename it to any extension you want with os.rename(filename, filename+'.<extension>').

How to write file name to use URL name in python?

I have an API scan of a large URL file, read that URL and get the result in JSON
I get the kind of url and domain like
google.com
http://c.wer.cn/311/369_0.jpg
How to change file format name using url name ".format (url_scan, dates)"
If I use manual name and it successfully creates a file, but I want to use it to read all URL names from the URL text file used for file name
The domain name is used for json file name and created successfully without errors
dates = yesterday.strftime('%y%m%d')
savefile = Directory + "HTTP_{}_{}.json".format(url_scan,dates)
out = subprocess.check_output("python3 {}/pa.py -K {} "
"--sam '{}' > {}"
.format(SCRIPT_DIRECTORY, API_KEY_URL, json.dumps(payload),savefile ), shell=True).decode('UTF-8')
result_json = json.loads(out)
with open(RES_DIRECTORY + 'HTTP-aut-20{}.csv'.format(dates), 'a') as f:
import csv
writer = csv.writer(f)
for hits in result_json['hits']:
writer.writerow([url_scan, hits['_date'])
print('{},{},{}'.format(url_scan, hits['_date']))
Only the error displayed when the http url name is used to write the json file name
So the directory is not a problem
Every / shown is interpreted by the system as a directory
[Errno 2] No such file or directory: '/Users/tes/HTTP_http://c.wer.cn/311/369_0.jpg_190709.json'
Most, if not all, operating systems disallow the characters : and / from being used in filenames as they have special meaning in URL strings. So that's why it's giving you an error.
You could replace those characters like this, for example:
filename = 'http://c.wer.cn/311/369_0.jpg.json google.com.json'
filename = filename.replace(':', '-').replace('/', '_')

How to save PDF files using Scrapy?

I am new to Python and have a problem using Scrapy. I need to download some PDF files from URLs (The URLs point to PDFs, but there is no .pdf in them.) and store them in a directory.
So far I have populated my items with title (as you can see I have passed the title as metadata of my previous request) and the body (which I get from the response body of my last request).
When it uses the with open function in my code, though, I always get an error back from the terminal like this:
exceptions.IOError: [Errno 2] No such file or directory:
Here is my code:
def parse_objects:
....
item = Item()
item['title'] = titles.xpath('text()').extract()
item['url'] = titles.xpath('a[#class="title"]/#href').extract()
request = Request(item['url'][0], callback = self.parse_urls)
request.meta['item'] = item
yield request
def parse_urls(self,response):
item = response.meta['item']
item['desc'] = response.body
with open(item['title'][1], "w") as f:
f.write(response.body)
I am using item['title'][1] because the title field is a list, and I need to save the PDF file using the second item which is the name. As far as I know, when I use with open and there is no such a file, Python creates a file automatically.
I'm using Python 3.4.
Can anyone help?
First you have find the Xpath of the URL, that you need to download.
And save those links into one list.
Import the python module name called Urllib { import urllib }
Use the keyword urllib.urlretrieve to download the PDF files.
Ex.,
import urllib
url=[]
url.append(hxs.select('//a[#class="df"]/#href').extract())
for i in range(len(url)):
urllib.urlretrieve(url[i],filename='%s'%i)

urllib2 File Download Fails...due to System Security?

I'm trying to download files (approximately 1 - 1.5MB/file) from a NASA server (URL), but to no avail! I've tried a few things with urllib2 and run into two results:
I create a new file on my machine that is only ~200KB and has nothing in it
I create a 1.5MB file on my machine that has nothing in it!
By "nothing in it" I mean when I open the file (these are hdf5 files, so I open them in hdfView) I see no hierarchical structure...literally looks like an empty h5 file. But, when I open the file in a text editor I can see there is SOMETHING there (it's binary, so in text it looks like...well, binary).
I think I am using urllib2 appropriately, though I have never successfully used urllib2 before. Would you please comment on whether what I am doing is right or not, and suggest something better?
from urllib2 import Request, urlopen, URLError, HTTPError
base_url = 'http://avdc.gsfc.nasa.gov/index.php?site=1480884223&id=40&go=list&path=%2FH2O%2F/2010'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
url = base_url + file_name
req = Request(url)
# Open the url
try:
f = urlopen(req)
print "downloading " + url
# Open our local file for writing
local_file = open('test.he5', "w" + file_mode)
#Write to our local file
local_file.write(f.read())
local_file.close()
except HTTPError, e:
print "HTTP Error:",e.code , url
except URLError, e:
print "URL Error:",e.reason , url
I got this script (which seems to be the closest to working) from here.
I am unsure what the file_name should be. I looked at the page source information of the archive and pulled the file name listed there (not the same as what shows up on the web page), and doing this yields the 1.5MB file that shows nothing in hdfview.
You are creating an invalid url:
base_url = 'http://avdc.gsfc.nasa.gov/index.php?site=1480884223&id=40&go=list&path=%2FH2O%2F/2010'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
url = base_url + file_name
You probably meant:
base_url = 'http://avdc.gsfc.nasa.gov/'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
When downloading a large file, it's better to use a buffered copy from filehandle to filehandle:
import shutil
# ...
f = urlopen(req)
with open('test.he5', "w" + file_mode) as local_file:
shutil.copyfileobj(f, local_file)
.copyfileobj will efficiently load from the open urllib connection and write to the open local_file file handle. Note the with statement, when the code block underneath concludes it'll automatically close the file for you.

Categories

Resources