How do I get the HTML of a website using Python 3?

How do I get the HTML of a website using Python 3? - python

I've been trying to do this with repl.it and have tried several solutions on this site, but none of them work. Right now, my code looks like
import urllib
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
print (urllib.urlopen(url).read())
but it just says "AttributeError: module 'urllib' has no attribute 'urlopen'".
If I add import urllib.urlopen, it tells me there's no module named that. How can I fix my problem?

The syntax you are using for the urllib library is from Python v2. The library has changed somewhat for Python v3. The new notation would look something more like:
import urllib.request
response = urllib.request.urlopen("http://www.google.com")
html = response.read()
The html object is just a string, with the returned HTML of the site. Much like the original urllib library, you should not expect images or other data files to be included in this returned object.
The confusing part here is that, in Python 3, this would fail if you did:
import urllib
response = urllib.request.urlopen("http://www.google.com")
html = response.read()
This strange module-importing behavior is, I am told, as intended and working. BUT it is non-intuitive and awkward. More importantly, for you, it makes the situation harder to debug. Enjoy.

Python3
import urllib
import requests
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
r = urllib.request.urlopen(url).read()
print(r)
or
import urllib.request
url = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345"
r = urllib.request.urlopen(url).read()
print(r)

Related

how can I get data properly?

Hi I am new to python I use python 3 on a mac. I don't know if this is relevant. Now to the question. I need for school data from an api, but I get an error.
<module 'requests' from '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/__init__.py'>. Can somebody explain what this means
import requests
requests.get('https://api.github.com')
print(requests)

You are printing the module requests instead of the response of your request.
Try this one:
import requests
res = requests.get('https://api.github.com')
print(res.content)

Convert urllib2 python code to use urllib module

I have the following code below which runs using the urllib2 module, but I have a requirement to upgrade to Python 3.x and this prevents the use of urllib2. I am aware it is split across urllib.request and urllib.error, but I am struggling to convert the following code to use the urllib module instead after reading through the doc and other relevant questions. Any help is greatly appreciated.
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(url=event['ResponseURL'], data=data)
request.add_header('Content-Type', '')
request.get_method = lambda: 'PUT'
url = opener.open(request)

All you need to do is replace urllib2 with urllib.request. You are not using anything that has moved to other urllib.* modules:
import urllib.request
opener = urllib.request.build_opener(urllib.request.HTTPHandler)
request = urllib.request.Request(url=event['ResponseURL'], data=data)
request.add_header('Content-Type', '')
request.get_method = lambda: 'PUT'
url = opener.open(request)
You can always run the 2to3 command-line tool on your Python 2 code and see what changes it makes; the default action is to output changes on stdout in unified diff format.
The urllib fixer will then also add imports for urllib.error and urllib.parse at the top, because it knows that code that imported urllib2 could need any of the 3 urllib.* modules; it isn't smart enough to limit the import only to those that are actually needed after transforming the rest of the urllib2 references in the module.

How to scrape data from JSON/Javascript of web page?

I'm new to Python, just get started with it today.
My system environment are Python 3.5 with some libraries on Windows10.
I want to extract football player data from site below as CSV file.
Problem: I can not extract data from soup.find_all('script')[17] to my expected CSV format. How to extract those data as I want ?
My code is shown as below.
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th
My expected output would be similar to this
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik

As #josiah Swain said, it's not going to be pretty. For this sort of thing it's more recommended to use JS as it can understand what you have.
Saying that, python is awesome and here is you solution!
#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
#And one more
import json
# The code you had
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')
# Store the script
script = soup.find_all('script')[17]
# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n')
if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]
# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
.replace('squad.register_players($.parseJSON(\'', '') \
.replace('\'));','')
# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']]
for p in json.loads(cleanJSON)
if p['player'] is not None]
print('position,slot_position,slug')
for line in data:
print(','.join(line))
The result I get for copying and pasting this into python is:
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork
Edit: On reflection this is not the easiest code to read for a beginner. Here is a easier to read version
# ... All that previous code
script = soup.find_all('script')[17]
allScriptLines = script.text.split('\n')
uncleanJson = None
for line in allScriptLines:
# Remove left whitespace (makes it easier to parse)
cleaner_line = line.lstrip()
if cleaner_line.startswith('squad.register_players($.parseJSON'):
uncleanJson = cleaner_line
cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')
print('position,slot_position,slug')
for player in json.loads(cleanJSON):
if player['player'] is not None:
print(player['position'],player['data']['slot_position'],player['data']['slug'])

So my understanding is that beautifulsoup is better for HTML parsing, but you are trying to parse javascript nested in the HTML.
So you have two options
Simply create a function that takes the result of soup.find_all('script')[17], loop and search the string manually for the data and extract it. You can even use ast.literal_eval(string_thats_really_a_dictionary) to make it even easier. This is may not be the best a approach but if you are new to python you might want to do it this just for practice.
Use the json library like in this example. or alternatively like this way. This is probably the better way to do.

Download file as string in python

I want to download a file to python as a string. I have tried the following, but it doesn't seem to work. What am I doing wrong, or what else might I do?
from urllib import request
webFile = request.urlopen(url).read()
print(webFile)

The following example works.
from urllib.request import urlopen
url = 'http://winterolympicsmedals.com/medals.csv'
output = urlopen(url).read()
print(output.decode('utf-8'))
Alternatively, you could use requests which provides a more human readable syntax. Keep in mind that requests requires that you install additional dependencies, which may increase the complexity of deploying the application, depending on your production enviornment.
import requests
url = 'http://winterolympicsmedals.com/medals.csv'
output = requests.get(url).text
print(output)

In Python3.x, using package 'urllib' like this:
from urllib.request import urlopen
data = urlopen('http://www.google.com').read() #bytes
body = data.decode('utf-8')

Another good library for this is http://docs.python-requests.org
It's not built-in, but I've found it to be much more usable than urllib*.

Script to connect to a web page

Looking for a python script that would simply connect to a web page (maybe some querystring parameters).
I am going to run this script as a batch job in unix.

urllib2 will do what you want and it's pretty simple to use.
import urllib
import urllib2
params = {'param1': 'value1'}
req = urllib2.Request("http://someurl", urllib.urlencode(params))
res = urllib2.urlopen(req)
data = res.read()
It's also nice because it's easy to modify the above code to do all sorts of other things like POST requests, Basic Authentication, etc.

Try this:
aResp = urllib2.urlopen("http://google.com/");
print aResp.read();

If you need your script to actually function as a user of the site (clicking links, etc.) then you're probably looking for the python mechanize library.
Python Mechanize

A simple wget called from a shell script might suffice.

in python 2.7:
import urllib2
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urllib2.urlopen(url+"?"+params).read()
print html
more info at https://docs.python.org/2.7/library/urllib2.html
in python 3.6:
from urllib.request import urlopen
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urlopen(url+"?"+params).read()
print(html)
more info at https://docs.python.org/3.6/library/urllib.request.html
to encode params into GET format:
def myEncode(dictionary):
result = ""
for k in dictionary: #k is the key
result += k+"="+dictionary[k]+"&"
return result[:-1] #all but that last `&`
I'm pretty sure this should work in either python2 or python3...

What are you trying to do? If you're just trying to fetch a web page, cURL is a pre-existing (and very common) tool that does exactly that.
Basic usage is very simple:
curl www.example.com

You might want to simply use httplib from the standard library.
myConnection = httplib.HTTPConnection('http://www.example.com')
you can find the official reference here: http://docs.python.org/library/httplib.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I get the HTML of a website using Python 3? - python

Related

how can I get data properly?

Convert urllib2 python code to use urllib module

How to scrape data from JSON/Javascript of web page?

Download file as string in python

Script to connect to a web page

Categories

Resources