I am trying to search through http://www.wegottickets.com/ with the keywords "Live music". But the returned result is still the main page, not the search result page including lots of live music information. Could anyone show me out what the problem is?
from urllib2 import urlopen
from ClientForm import ParseResponse
response = urlopen("http://www.wegottickets.com/")
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]
form.set_value("Live music", name="unified_query")
form.set_all_readonly(False)
control = form.find_control(type="submit")
print control.disabled
print control.readonly
#print form
request2 = form.click()
try:
response2 = urlopen(request2)
except:
print "Unsccessful query"
print response2.geturl()
print response2.info()
print response.read()
response2.close()
Thank you very much!
Never used it, but I've had success with the python mechanize module, if it turns out to be a fault in clientform.
However, as a first step, I'd suggest removing your try...except wrapper. What you're basically doing is saying "catch any error, then ignore the actual error and print 'Unsuccessful Query' instead". Not helpful for debugging. The exception will stop the program and print a useful error message, if you don't get in its way.
Related
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
So I`m trying to send a request to a webpage and read its response. I did a code that compares the request and the page, and I cant get the same page text. Am I using "requests" correctly?
I really think that I misunderstand how requests function works and what it does. Can someone help me please?
import requests
import urllib
def search():
pr = {'q':'pink'}
r = requests.get('http://stackoverflow.com/search',params=pr)
returntext = r.text
urllibtest(returntext)
def urllibtest(returntext):
connection = urllib.urlopen("http://stackoverflow.com/search?q=pink")
output = connection.read()
connection.close()
if output == returntext:
print("ITS THE SAME PAGE")
else:
print("ITS NOT THE SAME PAGE")
search()
First of all, there is no good reason to expect two different stack overflow searches to return the exact same response anyway.
There is one logical difference here too, requests automatically decodes the output for you:
>>> type(output)
str
>>> type(r.text)
unicode
You can use the content instead if you don't want it decoded, and use a more predictable source to see the same content returned - for example:
>>> r1 = urllib.urlopen('http://httpbin.org').read()
>>> r2 = requests.get('http://httpbin.org').content
>>> r1 == r2
True
Sorry in advance for the beginner question. I'm just learning how to access web data in Python, and I'm having trouble understanding exception handling in the requests package.
So far, when accessing web data using the urllib package, I wrap the urlopen call in a try/except structure to catch bad URLs, like this:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
This is obviously kind of a crude way to do it, as it can mask all kinds of problems other than bad URLs.
From the documentation, I had sort of gathered that you could avoid the try/except structure when using the requests package, like this:
import requests, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
r = requests.get(url)
if r.raise_for_status() is not None:
print 'Failed to open url.'
sys.exit()
text = r.text
print text
However, this clearly doesn't work (throws an error and a traceback). What's the "right" (i.e., simple, elegant, Pythonic) way to do this?
Try to catch connection error:
from requests.exceptions import ConnectionError
try:
requests.get('https://httpbinTYPO.org/')
except ConnectionError:
print 'Failed to open url.'
You can specify a kind of exception after the keyword except. So to catch just errors that come from bad connections, you can do:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except IOError:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
How to bypass missing link and continue to scrape good data?
I am using Python2 and Ubuntu 14.04.3.
I am scraping a web page with multiple links to associated data.
Some associated links are missing so I need a way to bypass the missing links and continue scraping.
Web page 1
part description 1 with associated link
part description 2 w/o associated link
more part descriptions with and w/o associcated links
Web page n+
more part descriptions
I tried:
try:
Do some things.
Error caused by missing link.
except Index Error as e:
print "I/O error({0}): {1}".format(e.errno, e.strerror)
break # to go on to next link.
# Did not work because program stopped to report error!
Since link is missing on web page can not use if missing link statement.
Thanks again for your help!!!
I corrected my faulty except error by following Python 2 documentation. Except correction jumped faulty web site missing link and continued on scraping data.
Except correction:
except:
# catch AttributeError: 'exceptions.IndexError' object has no attribute 'errno'
e = sys.exc_info()[0]
print "Error: %s" % e
break
I will look into the answer(s) posted to my questions.
Thanks again for your help!
Perhaps you are looking for something like this:
import urllib
def get_content_safe(url):
try:
contents = urllib.open(url)
return contents
except IOError, ex:
# Report ex your way
return None
def scrape:
# ....
content = get_content_safe(url)
if content == None:
pass # or continue or whatever
# ....
Long story short, just like Basilevs said, when you catch exception, your code will not break and will keep its execution.
I'm making auto-login script by use mechanize python.
Before I was used mechanize with no problem, but www.gmarket.co.kr in this site I couldn't make it .
whenever i try to login always login page was returned even with correct gmarket id , pass, i can't login and I saw some suspicious message
"<script language=javascript>top.location.reload();</script>"
I think this related with my problem, but don't know exactly how to handle .
Here is sample id and pass for login test
id: tgi177 pass: tk1047
if anyone can help me much appreciate thanks in advance
CODE:
# -*- coding: cp949 -*-
from lxml.html import parse, fromstring
import sys,os
import mechanize, urllib
import cookielib
import re
from BeautifulSoup import BeautifulSoup,BeautifulStoneSoup,Tag
try:
params = urllib.urlencode({'command':'login',
'url':'http%3A%2F%2Fwww.gmarket.co.kr%2F',
'member_type':'mem',
'member_yn':'Y',
'login_id':'tgi177',
'image1.x':'31',
'image1.y':'26',
'passwd':'tk1047',
'buyer_nm':'',
'buyer_tel_no1':'',
'buyer_tel_no2':'',
'buyer_tel_no3':''
})
rq = mechanize.Request("http://www.gmarket.co.kr/challenge/login.asp")
rs = mechanize.urlopen(rq)
data = rs.read()
logged_in = r'input_login_check_value' in data
if logged_in:
print ' login success !'
rq = mechanize.Request("http://www.gmarket.co.kr")
rs = mechanize.urlopen(rq)
data = rs.read()
print data
else:
print 'login failed!'
pass
quit()
except:
pass
mechanize doesn't have the ability to interact with JavaScript. Probably spidermonkey module will help you (I have no experience with it, but description is quite promising). Also you could handle such reload (e.g.Browser.reload() for this particular case) manually if it's the only site you have this problem.
Update:
Quick look through your page shows that you have submit to other URL (with https: scheme). Look through checkValid() JavaScript function. Posting to it gives other result. Note, that this looks like homework you should do yourself before asking.