Please could someone convert the following from python2 to python3;
import requests
url = "http://duckduckgo.com/html"
payload = {'q':'python'}
r = requests.post(url, payload)
with open("requests_results.html", "w") as f:
f.write(r.content)
and I get;
Traceback (most recent call last):
File "C:\temp\Python\testFile.py", line 1, in <module>
import requests
ImportError: No module named 'requests'
I have also tried;
import urllib.request
url = "http://duckduckgo.com/html"
payload = {'q':'python'}
r = urllib.request.post(url, payload)
with open("requests_results.html", "w") as f:
f.write(r.content)
but I get
Traceback (most recent call last):
File "C:\temp\Python\testFile.py", line 5, in <module>
r = urllib.request.post(url, payload)
AttributeError: 'module' object has no attribute 'post'
So, in python3.2, r.content is a bytestring, not a str, and write does not like it. You might want to use r.text instead:
with open("requests_results.html", "w") as f:
f.write(r.text)
You can see it in the requests documentation in http://docs.python-requests.org/en/latest/api.html#main-interface
class requests.Response
content - Content of the response, in bytes.
text - Content of the response, in unicode. if Response.encoding is None and chardet module is available,
encoding will be guessed.
Edit:
I posted before seeing the edited question. Yeah, like Martijn Pieters said, you need to install the requests module for python3 in order to be able to import it.
I think the problem here is that there is no Requests package installed. Or if you have installed it's installed in your python2.x directory and not in python3 so which is why you're not able to use requests module. Try making python3 as your default copy and then install requests.
Also try visiting thisarticle by Michael Foord which walks you through using all the features of urlib2
import urllib.request
import urllib.parse
url = "https://duckduckgo.com/html"
values = {'q':'python'}
data = urllib.parse.urlencode(values).encode("utf-8")
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page)
Related
This is the code that I have been trying:
import ijson
import urllib.request
with urllib.request.urlopen(some_link) as read_file:
path_array = ijson.items(read_file, object_in_json)
but I get this error:
(b'lexical error: invalid char in json text.\n \x1f\x8b\x08 (right here) ------^\n',)
Probably links are not supported by that library.
I advice you to use the requests module. So install it launching pip install requests.
Then, in your .py file:
import requests
response = requests.get(url)
json = response.json()
I'm getting this response when I open this url:
r = Request(r'http://airdates.tv/')
h = urlopen(r).readline()
print(h)
Response:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xed\xbdkv\xdbH\x96.\xfa\xbbj\x14Q\xaeuJ\xce\xee4E\x82\xa4(9m\xe7\xd2\xd3VZ\xaf2e\xab2k\xf5\xc2\n'
What encoding is this?
Is there a way to decode it based on the standard library?
Thank you in advance for any insight on this matter!
PS: It seems to be gzip.
It's gzip compressed HTML, as you suspected.
Rather than use urllib use requests which will decompress the response for you:
import requests
r = requests.get('http://airdates.tv/')
print(r.text)
You can install it with pip install requests, and never look back.
If you really must restrict yourself to the standard library, then decompress it with the gzip module:
import gzip
import urllib2
from cStringIO import StringIO
f = urllib2.urlopen('http://airdates.tv/')
# how to determine the content encoding
content_encoding = f.headers.get('Content-Encoding')
#print(content_encoding)
# how to decompress gzip data with Python 3
if content_encoding == 'gzip':
response = gzip.decompress(f.read())
# decompress with Python 2
if content_encoding == 'gzip':
gz = gzip.GzipFile(fileobj=StringIO(f.read())
response = gz.read()
mhawke's solution (using requests instead of urllib) works perfectly and in most cases should be preferred.
That said, I was looking for a solution that does not require installing 3rd party libraries (hence my choice of urllib over requests).
I found a solution using standard libraries:
import zlib
from urllib.request import Request, urlopen
r = Request(r'http://airdates.tv/')
h = urlopen(r).read()
decomp_gzip = zlib.decompress(h, 16+zlib.MAX_WBITS)
print(decomp_gzip)
Which yields the following response:
b'<!DOCTYPE html>\n (continues...)'
I want to have a user input a file URL and then have my django app download the file from the internet.
My first instinct was to call wget inside my django app, but then I thought there may be another way to get this done. I couldn't find anything when I searched. Is there a more django way to do this?
You are not really dependent on Django for this.
I happen to like using requests library.
Here is an example:
import requests
def download(url, path, chunk=2048):
req = requests.get(url, stream=True)
if req.status_code == 200:
with open(path, 'wb') as f:
for chunk in req.iter_content(chunk):
f.write(chunk)
f.close()
return path
raise Exception('Given url is return status code:{}'.format(req.status_code))
Place this is a file and import into your module whenever you need it.
Of course this is very minimal but this will get you started.
You can use urlopen from urllib2 like in this example:
import urllib2
pdf_file = urllib2.urlopen("http://www.example.com/files/some_file.pdf")
with open('test.pdf','wb') as output:
output.write(pdf_file.read())
For more information, read the urllib2 docs.
I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.
You should use response.content in this case:
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.
Relates:
How to download large file in python with requests.py?
How to download image using requests
In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.
from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)
You can use urllib:
import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")
Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.
My solution:
Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.
Save the below as downloadFile.py.
Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension
Remember to add an extension!
Example usage: python downloadFile.py http://www.google.co.uk google.html
import requests
import sys
import os
def downloadFile(url, fileName):
with open(fileName, "wb") as file:
response = requests.get(url)
file.write(response.content)
scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')
Generally, this should work in Python3:
import urllib.request
..
urllib.request.get(url)
Remember that urllib and urllib2 don't work properly after Python2.
If in some mysterious cases requests don't work (happened with me), you can also try using
wget.download(url)
Related:
Here's a decent explanation/solution to find and download all pdf files on a webpage:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
regarding Kevin answer to write in a folder tmp, it should be like this:
with open('./tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
he forgot . before the address and of-course your folder tmp should have been created already
What I am trying to do is use Beautifulsoup to download every zip file from the Google Patent archive. Below is the code that i've written thus far. But it seems that I am having troubles getting the files to download into a directory on my desktop. Any help would be greatly appreciated
from bs4 import BeautifulSoup
import urllib2
import re
import pandas as pd
url = 'http://www.google.com/googlebooks/uspto-patents-grants.html'
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
soup.prettify()
path = open('/Users/username/Desktop/', "wb")
for name in soup.findAll('a', href=True):
print name['href']
linkpath = name['href']
rq = urllib2.request(linkpath)
res = urllib2.urlope
The result that I am supposed to get, is that all of the zip files are supposed to download into a specific dir. Instead, I am getting the following error:
> #2015 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last)
> <ipython-input-13-874f34e07473> in <module>() 17 print name['href'] 18
> linkpath = name['href'] ---> 19 rq = urllib2.request(namep) 20 res =
> urllib2.urlopen(rq) 21 path.write(res.read()) AttributeError: 'module'
> object has no attribute 'request' –
In addition to using a non-existent request entity from urllib2, you don't output to a file correctly - you can't just open the directory, you have to open each file for output separately.
Also, the 'Requests' package has a much nicer interface than urllib2. I recommend installing it.
Note, that, today anyway, the first .zip is 5.7Gb so streaming to a file is essential.
Really, you want something more like this:
from BeautifulSoup import BeautifulSoup
import requests
# point to output directory
outpath = 'D:/patent_zips/'
url = 'http://www.google.com/googlebooks/uspto-patents-grants.html'
mbyte=1024*1024
print 'Reading: ', url
html = requests.get(url).text
soup = BeautifulSoup(html)
print 'Processing: ', url
for name in soup.findAll('a', href=True):
zipurl = name['href']
if( zipurl.endswith('.zip') ):
outfname = outpath + zipurl.split('/')[-1]
r = requests.get(zipurl, stream=True)
if( r.status_code == requests.codes.ok ) :
fsize = int(r.headers['content-length'])
print 'Downloading %s (%sMb)' % ( outfname, fsize/mbyte )
with open(outfname, 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024): # chuck size can be larger
if chunk: # ignore keep-alive requests
fd.write(chunk)
fd.close()
This is your problem:
rq = urllib2.request(linkpath)
urllib2 is a module and it has no request entity/attribute in it.
I see a Request class in urllib2, but I'm unsure if that's what you intended to actually use...