I have been trying wget recently instead of requests and it is pretty straightforward and easy to use but I have been having a problem with a specific link.
When I try to download a png image from wikipedia for some reason wget.download keeps raising an IndexError when trying to write to the file this error in specific:
wget.download(url, f"C:/Users/Family/Pictures/downloads/{name}")
File "C:\Users\Family\AppData\Local\Programs\Python\Python38\lib\site-packages\wget.py", line 527, in download
filename = detect_filename(url, out, headers)
File "C:\Users\Family\AppData\Local\Programs\Python\Python38\lib\site-packages\wget.py", line 486, in detect_filename
names["headers"] = filename_from_headers(headers) or ''
File "C:\Users\Family\AppData\Local\Programs\Python\Python38\lib\site-packages\wget.py", line 258, in filename_from_headers
name = fnames[0].split('=')[1].strip(' \t"')
IndexError: list index out of range
I tried setting a specific filename but it still didn't work. When I use wget in my cmd it does not seem to have a problem with the url, so how can I fix this?
import wget
# This is the link to the image
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Image_created_with_a_mobile_phone.png/1200px-Image_created_with_a_mobile_phone.png"
# It does not seem to have a problem detecting the filename
name = wget.detect_filename(url)
# I tried to set a specific filename but I still got the error with ot whout it
wget.download(url, f"C:/Users/Family/Pictures/downloads/{name}")
I have the same case as you do.
When I did some research of the reply header,
in my case if the URL header does not contain 'content-disposition': 'attachment; filename="xxx"', it will throw this error, hope this help.
Related
I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name='Downloaded
Pages/www.python.org#content'> www.python.org#python-network
<_io.BufferedWriter name='Downloaded
Pages/www.python.org#python-network'>
Traceback (most recent call last): File "/Users/Lucas/Python/AP book
exercise/Web Scraping/linkVerification.py", line 26, in
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21]
Is a directory: 'Downloaded Pages/'
I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.
This is my code:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
Appreciate any advice, thanks.
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like #Thyebri said.
And also, remember that the file you write cannot contain characters like '/', '\' or '?'
So, i dont know if the following code it's messy or not, but using the re library i would do the following:
filename = re.sub('[\/*:"?]+', '-', linkUrlToOpen.split("://")[1])
downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')
So, first i remove part i remove the "https://" part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-' and that is the name that will be given to the file.
Hope it works!
this is an URL example "https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"
if you put it in the browser a file will start downloading in your system.
I want to download this file using python and store it somewhere on my computer
this is how tried
import requests
# first_url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
second_url="https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"
myfile = requests.get(second_url , allow_redirects=True)
# this works for the first URL
# open('example.pdf' , 'wb').write(myfile.content)
# this did't work for both of them
# open('example.txt' , 'wb').write(myfile.content)
# this works for the second URL
open('example.doc' , 'wb').write(myfile.content)
first: if I put the first_url in the browser it will download a pdf file, putting second_url will download a .doc file How can I know what type of file will the URL give to us or what type of file will be downloaded so that I use the correct open(...) method?
second: If I use the second URL in the browser a file with the name "T__proc_notices_notices_080_k_notice_doc_79545_770020123.docx" starts downloading. how can I know this file name when I try to download the file?
if you know any better solution kindly let me know for the implementation.
kindly have a quick look at Downloading Files from URLs and zip downloaded files in python question aswell
myfile.headers['content-type'] will give you the MIME-type of the URL's content and myfile.headers['content-disposition'] gives you info like filename etc. (if the response contains this header at all)
you can use response headers content-type like for first url it is application/pdf and sencond url for is application/msword you save file according to it. you can make extension dictinary where you can store possible file format and their types and match with it. your second question is also same like this one so i am taking your two urls from that question and for file name i am using just integers
all_Urls = ['https://omextemplates.content.office.net/support/templates/en-us/tf16402488.dotx' ,
'https://procurement-notices.undp.org/view_file.cfm?doc_id=257280']
extension_dict = {'application/vnd.openxmlformats-officedocument.wordprocessingml.document':'.docx',
'application/vnd.openxmlformats-officedocument.wordprocessingml.template':'.dotx',
'application/vnd.ms-word.document.macroEnabled.12':'.docm',
'application/vnd.ms-word.template.macroEnabled.12':'.dotm',
'application/pdf':'.pdf',
'application/msword':'.doc'}
for i,url in enumerate(all_Urls):
resp = requests.get(url)
response_headers = resp.headers
file_extension = extensio_dict[response_headers['Content-Type']]
with open(f"{i}.{file_extension}",'wb') as f:
f.write(resp.content)
for MIME-Type see this answer
I am writing small python code to download a file from follow link and retrieve original filename
and its extension.But I have come across one such follow link for which python downloads the file but it is without any extension whereas file has .txt extension when downloads using browser.
Below is the code I am trying :
from urllib.request import urlopen
from urllib.parse import unquote
import wget
filePath = 'D:\\folder_path'
followLink = 'http://example.com/Reports/Download/c4feb46c-8758-4266-bec6-12358'
response = urlopen(followLink)
if response.code == 200:
print('Follow Link(response url) :' + response.url)
print('\n')
unquote_url = unquote(response.url)
file_name = wget.detect_filename(response.url).replace('|', '_')
print('file_name - '+file_name)
wget.download(response.url,filePa
th)
file_name variable in above code is just giving 'c4feb46c-8758-4266-bec6-12358' as filename.
Where I want to download it as c4feb46c-8758-4266-bec6-12358.txt.
I have also tried to read file name from header i.e. response.info(). But not getting proper file name.
Anyone can please help me with this.I am stucked in my work.Thanks in advance.
Wget gets the filename from the URL itself. For example, if your URL was https://someurl.com/filename.pdf, it is saved as filename.pdf. If it was https://someurl.com/filename, it is saved as filename. Since wget.download returns the filename of the downloaded file, you can rename it to any extension you want with os.rename(filename, filename+'.<extension>').
attach_file is not picking the absolute url although file exist. its able to pic internal url and send file but not the absolute url
email.attach_file("http://devuserapi.doctorinsta.com/static/pdfs/Imran_1066.pdf",mimetype="application/pdf")
this file opens when i copy paste the url in browser. what could be the issue.
Thanks in advance
attach_file takes a file from your filesystem, not a URL, so you have to use a local path to it
See https://docs.djangoproject.com/en/1.9/topics/email/
One, untested, possibility is to use the attach method instead and to download the file on the fly:
import urllib2
response = urllib2.urlopen("http://devuserapi.doctorinsta.com/static/pdfs/Imran_1066.pdf")
email.attach('IMran_1066.pdf',response.read(),mimetype="application/pdf")
It lacks error checking to make sure the file was downloaded, of course, and I haven't actually tried it myself, but that might be an alternative for you.
I'm still pretty new to scripting. I'm trying to figure out a way to output a list of URL after the redirect has occurred. I have about 800 sites in a text file that I want to test for a redirect using a python script and output the final redirect to a file (on it's own line). Is this possible?
With the file open, I can't figure out how to make urllib2.urlopen() read a line in a text file. It seems to require a URL? Maybe there is another module or something else I should be using instead?
Please help.
Thanks!
I'd use the requests library:
import requests
with open('urls.txt') as url_file:
for url in url_file:
resp = requests.get(url.strip())
print resp.url