get source html in local system python - python

Dears I want get source page but not in internet rather in local system
example : url=urllib.request.urlopen ('c://1.html')
>>> import urllib.request
>>> url=urllib.request.urlopen ('http://google.com')
>>> page =url.read()
>>> page=page.decode()
>>> page
what's my problem ?

from os.path import abspath
with open(abspath('c:/1.html') as fh:
print(fh.read())
Since url.read() just gives you the data as-is, and .decode() doesn't really do anything except convert the byte data from the socket to a traditional string, just print the filecontents?
urllib is mainly (if not only) a transporter to recieve HTML data, not actually parse the content. So all it does is connect to the source, separate the headers and give you the content. If you've already stored it locally, in a file.. Well then urllib has no more use to you. Consider looking at a HTML Parsing library such as BeautifulSoup for instance.

Related

Getting a specific file from requested iframe

I want to get the file link from the anime I'm watching from the site.
`import requests
from bs4 import BeautifulSoup
import re
page = requests.get("http://naruto-tube.org/shippuuden-sub-219")
soup = BeautifulSoup(page.content, "html.parser")
inner_content = requests.get(soup.find("iframe")["src"])
print(inner_content.text)`
the output is the source code from the filehoster's website (ani-stream). However, my problem now is how to i get the "file: xxxxxxx" line to be printed and not the whole source code?
You can Beautiful Soup to parse the iframe source code and find the script elements, but from there you're on your own. The file: "xxxxx", line is in JavaScript code, so you'll have to find the function call (to playerInstance.setup() in this case) and decide which of the two such "file:" lines is the one you want, and strip away the unwanted JS syntax around the URL.
Regular expressions will help with that, and you're probably better off just looking for the lines in the iframe's HTML. You already have re imported, so I just replaced your last line with:
lines = re.findall("file: .*$", inner_content.text, re.MULTILINE)
print( '\n'.join(lines) )
...to get a list of lines with "file:" in them. You can (and should) use a fancier RE to find just the one with "http:// and allows only whitespace before "file:" on the lines. (Python, Java and my text editor all have different ideas about what's in an RE, so I have to go to docs every time I write one. You can do that too--it's your problem, after all.)
The requests.get() function doesn't seem to work to get the bytes. Try Vishnu Kiran's urlretrieve approach--maybe that will work. Using the URL in a browser window does seem to get the right video, though, so there may be a user agent and/or cookie setting that you'll have to spoof.
If the iframe's source is not the primary domain of the website(naruto-tube.org) its contents cannot be accessed via scraping.
You will have to use a different website or you will need to get the url in the Iframe and use some library like requests to call the url.
Note you must also pass all parameters to the url if any to actually get any result. Like so
import urllib
urllib.urlretrieve ("url from the Iframe", "mp4.mp4")

Retrieving full URL from cgi.FieldStorage

I'm passing a URL to a python script using cgi.FieldStorage():
http://localhost/cgi-bin/test.py?file=http://localhost/test.xml
test.py just contains
#!/usr/bin/env python
import cgi
print "Access-Control-Allow-Origin: *"
print "Content-Type: text/plain; charset=x-user-defined"
print "Accept-Ranges: bytes"
print
print cgi.FieldStorage()
and the result is
FieldStorage(None, None, [MiniFieldStorage('file', 'http:/localhost/test.xml')])
Note that the URL only contains http:/localhost - how do I pass the full encoded URI so that file is the whole URI? I've tried encoding the file parameter (http%3A%2F%2Flocalhost%2ftext.xml) but this also doesn't work
The screenshot shows that the output to the webpage isn't what is expected, but that the encoded url is correct
Your CGI script works fine for me using Apache 2.4.10 and Firefox (curl also). What web server and browser are you using?
My guess is that you are using Python's CGIHTTPServer, or something based on it. This exhibits the problem that you identify. CGIHTTPServer assumes that it is being provided with a path to a CGI script so it collapses the path without regard to any query string that might be present. Collapsing the path removes duplicate forward slashes as well as relative path elements such as ...
If you are using this web server I don't see any obvious way around by changing the URL. You won't be using it in production, so perhaps look at another web server such as Apache, nginx, lighttpd etc.
The problem is with your query parameters, you should be encoding them:
>>> from urllib import urlencode
>>> urlencode({'file': 'http://localhost/test.xml', 'other': 'this/has/forward/slashes'})
'other=this%2Fhas%2Fforward%2Fslashes&file=http%3A%2F%2Flocalhost%2Ftest.xml'

Downloading files from an http server in python

Using urllib2, we can get the http response from a web server. If that server simply holds a list of files, we could parse through the files and download each individually. However, I'm not sure what the easiest, most pythonic way to parse through the files would be.
When you get a whole http response of the generic file server list, through urllib2's urlopen() method, how can we neatly download each file?
Urllib2 might be OK to retrieve the list of files. For downloading large amounts of binary files PycURL http://pycurl.sourceforge.net/ is a better choice. This works for my IIS based file server:
import re
import urllib2
import pycurl
url = "http://server.domain/"
path = "path/"
pattern = '(.*?)' % path
response = urllib2.urlopen(url+path).read()
for filename in re.findall(pattern, response):
with open(filename, "wb") as fp:
curl = pycurl.Curl()
curl.setopt(pycurl.URL, url+path+filename)
curl.setopt(pycurl.WRITEDATA, fp)
curl.perform()
curl.close()
You can use urllib.urlretrieve (in Python 3.x: urllib.request.urlretrieve):
import urllib
urllib.urlretrieve('http://site.com/', filename='filez.txt')
This should be work :)
and this is a fnction that can do the same thing (using urllib):
def download(url):
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
Can you guarantee that the URL you're requesting is a directory listing? If so, can you guarantee the format of the directory listing?
If so, you could use lxml to parse the returned document and find all of the elements that hold the path to a file, then iterate over those elements and download each file.
Download the index file
If it's really huge, it may be worth reading a chunk at a time;
otherwise it's probably easier to just grab the whole thing into memory.
Extract the list of files to get
If the list is xml or html, use a proper parser;
else if there is much string processing to do, use regex;
else use simple string methods.
Again, you can parse it all-at-once or incrementally.
Incrementally is somewhat more efficient and elegant,
but unless you are processing multiple tens of thousands
of lines it's probably not critical.
For each file, download it and save it to a file.
If you want to try to speed things up, you could try
running multiple download threads;
another (significantly faster) approach might be
to delegate the work to a dedicated downloader
program like Aria2 http://aria2.sourceforge.net/ -
note that Aria2 can be run as a service and controlled
via XMLRPC, see http://sourceforge.net/apps/trac/aria2/wiki/XmlrpcInterface#InteractWitharia2UsingPython
My suggestion would be to use BeautifulSoup (which is an HTML/XML parser) to parse the page for a list of files. Then, pycURL would definitely come in handy.
Another method, after you've got the list of files, is to use urllib.urlretrieve in a way similar to wget in order to simply download the file to a location on your filesystem.
This is a non-convential way, but although it works
fPointer = open(picName, 'wb')
self.curl.setopt(self.curl.WRITEFUNCTION, fPointer.write)
urllib.urlretrieve(link, picName) - correct way
Here's an untested solution:
import urllib2
response = urllib2.urlopen('http://server.com/file.txt')
urls = response.read().replace('\r', '').split('\n')
for file in urls:
print 'Downloading ' + file
response = urllib2.urlopen(file)
handle = open(file, 'w')
handle.write(response.read())
handle.close()
It's untested, and it probably won't work. This is assuming you have an actual list of files inside of another file. Good luck!

In Python, how do I decode GZIP encoding?

I downloaded a webpage in my python script.
In most cases, this works fine.
However, this one had a response header: GZIP encoding, and when I tried to print the source code of this web page, it had all symbols in my putty.
How do decode this to regular text?
I use zlib to decompress gzipped content from web.
import zlib
import urllib
f=urllib.request.urlopen(url)
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)
Decompress your byte stream using the built-in gzip module.
If you have any problems, do show the exact minimal code that you used, the exact error message and traceback, together with the result of print repr(your_byte_stream[:100])
Further information
1. For an explanation of the gzip/zlib/deflate confusion, read the "Other uses" section of this Wikipedia article.
2. It can be easier to use the zlib module than the gzip module if you have a string rather than a file. Unfortunately the Python docs are incomplete/wrong:
zlib.decompress(string[, wbits[, bufsize]])
...The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. The default value is 15. When wbits is negative, the standard gzip header is suppressed; this is an undocumented feature of the zlib library, used for compatibility with unzip‘s compression file format.
Firstly, 8 <= log2_window_size <= 15, with the meaning given above. Then what should be a separate arg is kludged on top:
arg == log2_window_size means assume string is in zlib format (RFC 1950; what the HTTP 1.1 RFC 2616 confusingly calls "deflate").
arg == -log2_window_size means assume string is in deflate format (RFC 1951; what people who didn't read the HTTP 1.1 RFC carefully actually implemented)
arg == 16 + log_2_window_size means assume string is in gzip format (RFC 1952). So you can use 31.
The above information is documented in the zlib C library manual ... Ctrl-F search for windowBits.
For Python 3
Try out this:
import gzip
fetch = opener.open(request) # basically get a response object
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')
I use something like that:
f = urllib2.urlopen(request)
data = f.read()
try:
from cStringIO import StringIO
from gzip import GzipFile
data2 = GzipFile('', 'r', 0, StringIO(data)).read()
data = data2
except:
#print "decompress error %s" % err
pass
return data
If you use the Requests module, then you don't need to use any other modules because the gzip and deflate transfer-encodings are automatically decoded for you.
Example:
>>> import requests
>>> custom_header = {'Accept-Encoding': 'gzip'}
>>> response = requests.get('https://api.github.com/events', headers=custom_header)
>>> response.headers
{'Content-Encoding': 'gzip',...}
>>> response.text
'[{"id":"9134429130","type":"IssuesEvent","actor":{"id":3287933,...
The .text property of the response is for reading the content in the text context.
The .content property of the response is for reading the content in the binary context.
See the Binary Response Content section on docs.python-requests.org
Similar to Shatu's answer for python3, but arranged a little differently:
import gzip
s = Request("https://someplace.com", None, headers)
r = urlopen(s, None, 180).read()
try: r = gzip.decompress(r)
except OSError: pass
result = json_load(r.decode())
This method allows for wrapping the gzip.decompress() in a try/except to capture and pass the OSError that results in situations where you may get mixed compressed and uncompressed data. Some small strings actually get bigger if they are encoded, so the plain data is sent instead.
This version is simple and avoids reading the whole file first by not calling the read() method. It provides a file stream like object instead that behaves just like a normal file stream.
import gzip
from urllib.request import urlopen
my_gzip_url = 'http://my_url.gz'
my_gzip_stream = urlopen(my_gzip_url)
my_stream = gzip.open(my_gzip_stream, 'r')
None of these answers worked out of the box using Python 3. Here is what worked for me to fetch a page and decode the gzipped response:
import requests
import gzip
response = requests.get('your-url-here')
data = str(gzip.decompress(response.content), 'utf-8')
print(data) # decoded contents of page
You can use urllib3 to easily decode gzip.
urllib3.response.decode_gzip(response.data)

Python: Downloading a large file to a local path and setting custom http headers

I am looking to download a file from a http url to a local file. The file is large enough that I want to download it and save it chunks rather than read() and write() the whole file as a single giant string.
The interface of urllib.urlretrieve is essentially what I want. However, I cannot see a way to set request headers when downloading via urllib.urlretrieve, which is something I need to do.
If I use urllib2, I can set request headers via its Request object. However, I don't see an API in urllib2 to download a file directly to a path on disk like urlretrieve. It seems that instead I will have to use a loop to iterate over the returned data in chunks, writing them to a file myself and checking when we are done.
What would be the best way to build a function that works like urllib.urlretrieve but allows request headers to be passed in?
What is the harm in writing your own function using urllib2?
import os
import sys
import urllib2
def urlretrieve(urlfile, fpath):
chunk = 4096
f = open(fpath, "w")
while 1:
data = urlfile.read(chunk)
if not data:
print "done."
break
f.write(data)
print "Read %s bytes"%len(data)
and using request object to set headers
request = urllib2.Request("http://www.google.com")
request.add_header('User-agent', 'Chrome XXX')
urlretrieve(urllib2.urlopen(request), "/tmp/del.html")
If you want to use urllib and urlretrieve, subclass urllib.URLopener and use its addheader() method to adjust the headers (ie: addheader('Accept', 'sound/basic'), which I'm pulling from the docstring for urllib.addheader).
To install your URLopener for use by urllib, see the example in the urllib._urlopener section of the docs (note the underscore):
import urllib
class MyURLopener(urllib.URLopener):
pass # your override here, perhaps to __init__
urllib._urlopener = MyURLopener
However, you'll be pleased to hear wrt your comment to the question comments, reading an empty string from read() is indeed the signal to stop. This is how urlretrieve handles when to stop, for example. TCP/IP and sockets abstract the reading process, blocking waiting for additional data unless the connection on the other end is EOF and closed, in which case read()ing from connection returns an empty string. An empty string means there is no data trickling in... you don't have to worry about ordered packet re-assembly as that has all been handled for you. If that's your concern for urllib2, I think you can safely use it.

Categories

Resources