Request on url having mandarin character - python

Trying to fetch content on below url but not working with requests module. although link opens up on browser. How to Get link using requests library
In [2]: requests.get('http://www.dwconstir.com/inc/download.asp?FileName=3Q17%20%uC2E4%uC801PT_ENG.pdf')
Out[2]: <Response [400]>

You are trying to download a PDF. You can do that using urllib2
Sample
import urllib2
src_url = "http://www.dwconstir.com/inc/download.asp?FileName=3Q17%20%uC2E4%uC801PT_ENG.pdf"
path = "DEST_PATH" #Folder location where you want to download the file.
response = urllib2.urlopen(src_url)
file = open(path + "document.pdf", 'wb')
file.write(response.read())
file.close()
Using requests
import requests
url = 'http://www.dwconstir.com/inc/download.asp?FileName=3Q17%20%uC2E4%uC801PT_ENG.pdf'
path = "DEST_PATH" #Folder location where you want to download the file.
r = requests.get(url, stream=True)
with open(path + "document.pdf", 'wb') as f:
f.write(r.content)

Related

Downloading pdf from URL with urllib (getting weird html output instead)

I am trying to download pdfs from several pdf urls.
An example: https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf
This url directly opens into the PDF on my browser.
However, when I use this code to download it using the link, it returns an HTML file given below.
link = "https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf"
urllib.request.urlretrieve(link, f"/content/drive/MyDrive/Research/pdfs/1.pdf")
The resulting "pdf" file or HTML code file is downloaded instead:
How do I solve this issue? Appreciate any help, thanks!
You can use BeautifulSoup or lxml to find <iframe> and get src - and use it to download file
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup as BS
url = 'https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf'
response = urllib.request.urlopen(url)
soup = BS(response.read(), 'html.parser')
iframe = soup.find('iframe')
url = iframe['src']
filename = urllib.parse.unquote(url)
filename = filename.rsplit('/', 1)[-1]
urllib.request.urlretrieve(url, filename)
Eventually you can check few file to see if all use the same https://d2x0djib3vzbzj.cloudfront.net/ and simply replace it in url.
import urllib.request
import urllib.parse
url = 'https://www.fasb.org/page/showpdf?path=0001-%201700-UFI%20AICPA%20ACSEC%20Hanson.pdf'
url = url.replace('https://www.fasb.org/page/showpdf?path=',
'https://d2x0djib3vzbzj.cloudfront.net/')
filename = urllib.parse.unquote(url)
filename = filename.rsplit('/', 1)[-1]
urllib.request.urlretrieve(url, filename)

Download a zip file from a DL link to a specific folder in Python

I have this download link that I've extracted from a site with the following code:
import urllib
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
url = r'https://rawkuma.com/manga/kimetsu-no-yaiba/'
req = Request(url, headers={'User-Agent':'Chrome'})
html = urlopen(req)
soup = BeautifulSoup(html, "html.parser")
body = soup.find_all('body')
content = body[0].find_all('div',{'id':'content'})
ul = content[0].find_all('ul')
chapter = ul[0].find_all('li',{'data-num':'128'})
dt = chapter[0].find_all('div',{'class':'dt'})
a = dt[0].find_all('a')
a = str(a[0])
href = a.split(" ")[2][5:]
Now I want to use that href which is a DL link to a zip file and download it to a specified folder. I've tried something like this:
save_path = r'C:\Users\...'
file_name = r'kimetsu-no-yaiba-chapter-128'
completeName = os.path.join(save_path, file_name+".zip")
file1 = open(completeName, "w")
file1.write(href)
file1.close()
But this seems to just add an empty zip file to the folder. And if I try to open the url first before putting into the write function it gives me an error:
req = Request(href)
r = urlopen(req)
save_path = r'C:\Users\...'
file_name = r'kimetsu-no-yaiba-chapter-128'
completeName = os.path.join(save_path, file_name+".zip")
file1 = open(completeName, "w")
file1.write(r)
file1.close()
But I get this error:
urllib.error.URLError: <urlopen error unknown url type: "https>
The url http://dl.rawkuma.com/?id=86046 is not the actual URI with the zip file, there is a redirect to the real link. So here is the code to download a zip file based on your example. You need to install the requests library to make it easier.
import requests
import os
URL = 'http://dl.rawkuma.com/?id=86046'
res = requests.get(URL, allow_redirects=True)
# this is the actual url for the zip file
print(res.url)
with requests.get(res.url, stream=True) as r:
r.raise_for_status()
print('downloading')
with open(os.path.join('.', 'file.zip'), 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)

My code wrongfully downloads a CSV file from an URL with Python

I created some code to download a CSV file from an URL. The code downloads the HTML code of the link, but when I copy the url that I created in a browser it works, but it does not in the code.
I tried os, response, and urllib, but all these options provided the same result.
This is the link that I ultimately want to download as CSV:
https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf/1506575576011.ajax?fileType=csv&fileName=IAPD_holdings&dataType=fund
import requests
#this is the url where the csv is
url='https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf?switchLocale=y&siteEntryPassthrough=true'
r = requests.get(url, allow_redirects=True)
response = requests.get(url)
if response.status_code == 200:
print("Success")
else:
print("Failure")
#find the url for the CSV
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content,'lxml')
for i in soup.find_all('a',{'class':"icon-xls-export"}):
print(i.get('href'))
# I get two types of files, one CSV and the other xls.
link_list=[]
for i in soup.find_all('a', {'class':"icon-xls-export"}):
link_list.append(i.get('href'))
# I create the link with the CSV
url_csv = "https://www.ishares.com//"+link_list[0]
response_csv = requests.get(url_csv)
if response_csv.status_code == 200:
print("Success")
else:
print("Failure")
#Here I want to download the file
import urllib.request
with urllib.request.urlopen(url_csv) as holdings1, open('dataset.csv', 'w') as f:
f.write(holdings1.read().decode())
I would like to get the CSV data downloaded.
It needs cookies to work correctly
I use requests.Session() to get and keep cookies automatically.
And I write in file response_csv.content because I already have it after second requests - so I don't have to make another requests. And because using urllib.request I will create requests without cookies and it may not works.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
url='https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf?switchLocale=y&siteEntryPassthrough=true'
response = s.get(url, allow_redirects=True)
if response.status_code == 200:
print("Success")
else:
print("Failure")
#find the url for the CSV
soup = BeautifulSoup(response.content,'lxml')
for i in soup.find_all('a',{'class':"icon-xls-export"}):
print(i.get('href'))
# I get two types of files, one CSV and the other xls.
link_list=[]
for i in soup.find_all('a', {'class':"icon-xls-export"}):
link_list.append(i.get('href'))
# I create the link with the CSV
url_csv = "https://www.ishares.com//"+link_list[0]
response_csv = s.get(url_csv)
if response_csv.status_code == 200:
print("Success")
f = open('dataset.csv', 'wb')
f.write(response_csv.content)
f.close()
else:
print("Failure")

Download a pdf embedded in webpage using python2.7

I want to download the pdf and store it in a folder on my local computer.
Following is the link of pdf i want to download https://ascopubs.org/doi/pdfdirect/10.1200/JCO.2018.77.8738
I have written code in both python selenium and using urllib but both failed to download.
import time, urllib
time.sleep(2)
pdfPath = "https://ascopubs.org/doi/pdfdirect/10.1200/JCO.2018.77.8738"
pdfName = "jco.2018.77.8738.pdf"
f = open(pdfName, 'wb')
f.write(urllib.urlopen(pdfPath).read())
f.close()
It's much easier with requests
import requests
url = 'https://ascopubs.org/doi/pdfdirect/10.1200/JCO.2018.77.8738'
pdfName = "./jco.2018.77.8738.pdf"
r = requests.get(url)
with open(pdfName, 'wb') as f:
f.write(r.content)
from pathlib import Path
import requests
filename = Path("jco.2018.77.8738.pdf")
url = "https://ascopubs.org/doi/pdfdirect/10.1200/JCO.2018.77.8738"
response = requests.get(url)
filename.write_bytes(response.content)

How can i download all types of file in python with request library

I am building the crawler in python and i have the list of href from the page.
Now i have the list of file extensions to download like
list = ['zip','rar','pdf','mp3']
How can i save the files from that url to local directory using python
EDIT:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.example.com/downlaod"
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
list_urls = soup.find_all('a')
print list_urls[6]
Going by your posted example:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.example.com/downlaod"
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
list_urls = soup.find_all('a')
print list_urls[6]
So, the URL you want to fetch next is presumably list_urls[6]['href'].
The first trick is that this might be a relative URL rather than absolute. So:
newurl = list_urls[6]['href']
absurl = urlparse.urljoin(site.url, newurl)
Also, you want to only fetch the file if it has the right extension, so:
if not absurl.endswith(extensions):
return # or break or whatever
But once you've decided what URL you want to download, it's no harder than your initial fetch:
page = urllib2.urlopen(absurl)
html = page.read()
path = urlparse.urlparse(absurl).path
name = os.path.basename(path)
with open(name, 'wb') as f:
f.write(html)
That's mostly it.
There are a few things you might want to add, but if so, you have to add them all manually. For example:
Look for a Content-disposition header with a suggested filename to use in place of the URL's basename.
copyfile from page to f instead of reading the whole thing into memory and then writeing it out.
Deal with existing files with the same name.
…
But that's the basics.
You can use python requests library as you have asked in question : http://www.python-requests.org
You can save file from url like this :
import requests
url='http://i.stack.imgur.com/0LJdh.jpg'
data=requests.get(url).content
filename="image.jpg"
with open(filename, 'wb') as f:
f.write(data)
solution using urllib3
import os
import urllib3
from bs4 import BeautifulSoup
import urllib.parse
url = "https://path/site"
site = urllib3.PoolManager()
html = site.request('GET', url)
soup = BeautifulSoup(html.data, "lxml")
list_urls = soup.find_all('a')
and then a recursive function to get all the files
def recursive_function(list_urls)
newurl = list_urls[0]['href']
absurl = url+newurl
list_urls.pop(0)
if absurl.endswith(extensions): # verify if contains the targeted extensions
page = urllib3.PoolManager()
html = site.request('GET', absurl)
name = os.path.basename(absurl)
with open(name, 'wb') as f:
f.write(html.data)
return recursive_function(list_urls)

Categories

Resources