Python Web Crawlers and "getting" html source code - python

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems
1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.
Just for background, I need to download a page and replace any img with ones I have
And it would be nice if you guys could tell me your opinion of 2.7 and 3.1

Use Python 2.7, is has more 3rd party libs at the moment. (Edit: see below).
I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
For parsing the code, have a look at BeautifulSoup.
BTW: what exactly do you want to do:
Just for background, I need to download a page and replace any img with ones I have
Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.

An Example with python3 and the requests library as mentioned by #leoluk:
pip install requests
Script req.py:
import requests
url='http://localhost'
# in case you need a session
cd = { 'sessionid': '123..'}
r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content
Now,execute it and you will get the html source of localhost!
python3 req.py

If you are using Python > 3.x you don't need to install any libraries, this is directly built in the python framework. The old urllib2 package has been renamed to urllib:
from urllib import request
response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)

The first thing you need to do is read the HTTP spec which will explain what you can expect to receive over the wire. The data returned inside the content will be the "rendered" web page, not the source. The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that. You only get the HTML that the server sent you. In the case of a static HTML page, then yes, you will be seeing the "source". But for anything else you see the generated HTML, not the source.
When you say modify the page and return the modified page what do you mean?

Related

Downloading torrent file using get request (.torrent)

I am trying to download torrent file from this code :
url = "https://itorrents.org/torrent/0BB4C10F777A15409A351E58F6BF37E8FFF53CDB.torrent"
r = requests.get(url, allow_redirects=True)
open('test123.torrent', 'wb').write(r.content)
It downloads a torrent file , but when i load it to bittorrent error occurs.
It says Unable to Load , Torrent Is Not Valid Bencoding
Can anybody please help me to resolve this problem ? Thanks in advance
This page use cloudflare to prevent scraping the page,I am sorry to say that bypassing cloudflare is very hard if you only use requests, the measures cloudflare takes will update soon.This page will check your browser whether it support Javascript.If not, they won't give you the bytes of the file.That's why you couldn't use them.(You could use r.text to see the response content, it is a html page.Not a file.)
Under this circumstance, I think you should consider about using selenium.
Bypassing Cloudflare can be a pain, so I suggest using a library that handles it. Please don't forget that your code may break in the future because Cloudflare changes their techniques periodically. Well, if you use the library, you will just need to update the library (at least you should hope for that).
I used a similar library only in NodeJS, but I see python also has something like that - cloudscraper
Example:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
Depending on your usage you may need to consider using proxies - CloudFlare can still block you if you send too many requests.
Also, if you are working with video torrents, you may be interested in Torrent Stream Server. It a server that downloads and streams video at the same time, so you can watch the video without fully downloading it.
We can do by adding cookies in headers .
But after some time cookie expires.
Therefore only solution is to download from opening browser

Extract HTML-Content from URL of Site that probably uses Cookies via Python

I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)

How to get live stream url from script

I need to get the live stream url using a scripting language such as python or shell
eg: http://rt.com/on-air/
I can get the url by using a tool such as the network monitor on Firefox, but i need to be able to get it via a script
After quick look on requests documentation:
from contextlib import closing
with closing(requests.get('http://rt.com/on-air/', stream=True)) as r:
# Do things with the response here.
If it doesn't help, please check another way:
import requests
r = requests.get('http://rt.com/on-air/', stream=True)
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
# do some sort of things
You need to identify it in the source of the page. It is pretty much the same as using the network tool from FF.
For python you can use beautifulsoup to parse the page and get more info out of it... or a simple regex.

Unable to get page source code in python

I'm trying to get the source code of a page by using:
import urllib2
url="http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionsville/750560"
page =urllib2.urlopen(url)
data=page.read()
print data
and also by using a user_agent(headers)
I did not succeed to get the source code of the page!
Have you guys any ideas what can be done?
Thanks in Advance
I tried it and the requests works, but the content that you receive says that your browser must accept cookies (in french). You could probably get around that with urllib2, but I think the easiest way would be to use the requests lib (if you don't mind having an additional dependency).
To install requests:
pip install requests
And then in your script:
import requests
url = 'http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionsville/750560'
response = requests.get(url)
print(response.content)
I'm pretty sure the source code of the page will be what you expect then.
requests library worked for me as Martin Maillard showed.
Also in another thread I have noticed this note by leoluk here:
Edit: It's 2014 now, and most of the important libraries have been
ported and you should definitely use Python 3 if you can.
python-requests is a very nice high-level library which is easier to
use than urllib2.
So I wrote this get_page procedure:
import requests
def get_page (website_url):
response = requests.get(website_url)
return response.content
print get_page('http://example.com')
Cheers!
I tried a lot of things, "urllib" "urllib2" and many other things, but one thing worked for me for everything I needed and solved any problem I faced. It was Mechanize .This library simulates using a real browser, so it handles a lot of issues in that area.

Fetch a Wikipedia article with Python

I try to fetch a Wikipedia article with Python's urllib:
f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")
s = f.read()
f.close()
However instead of the html page I get the following response: Error - Wikimedia Foundation:
Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT
Wikipedia seems to block request which are not from a standard browser.
Anybody know how to work around this?
You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.
Straight from the examples
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.
I have used it myself for two projects, and it works very well.
Rather than trying to trick Wikipedia, you should consider using their High-Level API.
In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:
'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'
Or, if you want the HTML code, use 'action=render' like in:
'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'
You can also define a section to get just part of the content with something like 'section=3'.
You could then access it using the urllib2 module (as sugested in the chosen answer).
However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.
Refer to MediaWiki's FAQ if you need more information.
The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.
In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.
requests is awesome!
Here is how you can get the html content with requests:
import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text
Done!
Try changing the user agent header you are sending in your request to something like:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)
You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.
Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.
If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.
As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.
import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()
This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

Categories

Resources