Unable to get page source code in python - python

I'm trying to get the source code of a page by using:
import urllib2
url="http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionsville/750560"
page =urllib2.urlopen(url)
data=page.read()
print data
and also by using a user_agent(headers)
I did not succeed to get the source code of the page!
Have you guys any ideas what can be done?
Thanks in Advance

I tried it and the requests works, but the content that you receive says that your browser must accept cookies (in french). You could probably get around that with urllib2, but I think the easiest way would be to use the requests lib (if you don't mind having an additional dependency).
To install requests:
pip install requests
And then in your script:
import requests
url = 'http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionsville/750560'
response = requests.get(url)
print(response.content)
I'm pretty sure the source code of the page will be what you expect then.

requests library worked for me as Martin Maillard showed.
Also in another thread I have noticed this note by leoluk here:
Edit: It's 2014 now, and most of the important libraries have been
ported and you should definitely use Python 3 if you can.
python-requests is a very nice high-level library which is easier to
use than urllib2.
So I wrote this get_page procedure:
import requests
def get_page (website_url):
response = requests.get(website_url)
return response.content
print get_page('http://example.com')
Cheers!

I tried a lot of things, "urllib" "urllib2" and many other things, but one thing worked for me for everything I needed and solved any problem I faced. It was Mechanize .This library simulates using a real browser, so it handles a lot of issues in that area.

Related

Downloading torrent file using get request (.torrent)

I am trying to download torrent file from this code :
url = "https://itorrents.org/torrent/0BB4C10F777A15409A351E58F6BF37E8FFF53CDB.torrent"
r = requests.get(url, allow_redirects=True)
open('test123.torrent', 'wb').write(r.content)
It downloads a torrent file , but when i load it to bittorrent error occurs.
It says Unable to Load , Torrent Is Not Valid Bencoding
Can anybody please help me to resolve this problem ? Thanks in advance
This page use cloudflare to prevent scraping the page,I am sorry to say that bypassing cloudflare is very hard if you only use requests, the measures cloudflare takes will update soon.This page will check your browser whether it support Javascript.If not, they won't give you the bytes of the file.That's why you couldn't use them.(You could use r.text to see the response content, it is a html page.Not a file.)
Under this circumstance, I think you should consider about using selenium.
Bypassing Cloudflare can be a pain, so I suggest using a library that handles it. Please don't forget that your code may break in the future because Cloudflare changes their techniques periodically. Well, if you use the library, you will just need to update the library (at least you should hope for that).
I used a similar library only in NodeJS, but I see python also has something like that - cloudscraper
Example:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
Depending on your usage you may need to consider using proxies - CloudFlare can still block you if you send too many requests.
Also, if you are working with video torrents, you may be interested in Torrent Stream Server. It a server that downloads and streams video at the same time, so you can watch the video without fully downloading it.
We can do by adding cookies in headers .
But after some time cookie expires.
Therefore only solution is to download from opening browser

Where is the place to run HTTP requests?

I've been mingling in the world of API's and GET requests, but I'm not sure how to make usual ones. I know the basic python and javascript ones, but I want to know how to make a straight-up one. The usual:
GET 'https://api.roblox.com'
or something like that.
Please help!
Making a GET request:
import requests # pip install requests
r = requests.get('https://api.roblox.com')
print(r.status_code, r.text)

How to prevent 301 code from redirecting website?

I am trying to connect to websites with Python and get the HTTP status codes. As answers on this other question of mine suggest, the reason that HTTP status code for websites such as google.com are 301 or 302 (permanently moved) is that these servers are redirecting. However, I would like to be able to connect to them in such a manner that I get the natural 200 (OK) from them. Here's my current code:
import httplib
conn = httplib.HTTPConnection("google.com", 80)
conn.request("GET","/")
r = conn.getresponse()
print r.status, r.reason
conn.close()
What do I need to alter/add to achieve this? I heard that pycurl library might help me with that, but googling hasn't brought any useful results so far. I am a novice in this field, so please excuse me if the question is trivial.
I assume what you want is for your code to follow the 301/302s to the end url which returns a 200?
If so you could try using urllib, or better still use requests which you can install with pip.
Both urllib and more reliably requests should follow 301's and 302's and give you the final page that returns a 200.
Info on the requests module can be found here:
http://pypi.python.org/pypi/requests/
Hope this helps.

What is the deal about https when using lxml?

I am using lxml to parse html files given urls.
For example:
link = 'https://abc.com/def'
htmltree = lxml.html.parse(link)
My code is working well for most of the cases, the ones with http://. However, I found for every https:// url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem?
BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.
I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with urllib2, though:
from lxml import html
from urllib2 import urlopen
html.parse(urlopen('https://duckduckgo.com'))
From the lxml documentation:
lxml can parse from a local file, an HTTP URL or an FTP URL
I don't see HTTPS in that sentence anywhere, so I assume it is not supported.
An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as urllib2, and pass the retrieved document as a string to lxml.

Python Web Crawlers and "getting" html source code

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems
1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.
Just for background, I need to download a page and replace any img with ones I have
And it would be nice if you guys could tell me your opinion of 2.7 and 3.1
Use Python 2.7, is has more 3rd party libs at the moment. (Edit: see below).
I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
For parsing the code, have a look at BeautifulSoup.
BTW: what exactly do you want to do:
Just for background, I need to download a page and replace any img with ones I have
Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.
An Example with python3 and the requests library as mentioned by #leoluk:
pip install requests
Script req.py:
import requests
url='http://localhost'
# in case you need a session
cd = { 'sessionid': '123..'}
r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content
Now,execute it and you will get the html source of localhost!
python3 req.py
If you are using Python > 3.x you don't need to install any libraries, this is directly built in the python framework. The old urllib2 package has been renamed to urllib:
from urllib import request
response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)
The first thing you need to do is read the HTTP spec which will explain what you can expect to receive over the wire. The data returned inside the content will be the "rendered" web page, not the source. The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that. You only get the HTML that the server sent you. In the case of a static HTML page, then yes, you will be seeing the "source". But for anything else you see the generated HTML, not the source.
When you say modify the page and return the modified page what do you mean?

Categories

Resources