I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
Related
I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".
I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'
I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.
Say I'm testing an RSS feed view in a Django app, is this how I should go about it?
def test_some_view(...):
...
requested_url = reverse("personal_feed", args=[some_profile.auth_token])
resp = client.get(requested_url, follow=True)
...
assert dummy_object.title in str(resp.content)
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
Should I assert that dummy_object is in the response this way?
I'm testing here using the str representation of the response object. When is it a good practice to do this vs. using selenium? I know it makes it easier to verify that said obj or property (like dummy_object.title) is encapsulated within an H1 tag for example. On the other hand, if I don't care about how the obj is represented, it's faster to do it like the above.
Reevaluating my comment (didn't carefully read the question and overlooked the RSS feed stuff):
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
I would agree on that - from Django point, you are testing your views and don't care about what the exact endpoints they are mapped against. Using reverse is thus IMO the clear and correct approach.
Should I assert that dummy_object is in the response this way?
You have to pay attention here. response.content is a bytestring, so asserting dummy_object.title in str(resp.content) is dangerous. Consider the following example:
from django.contrib.syndication.views import Feed
class MyFeed(Feed):
title = 'äöüß'
...
Registered the feed in urls:
urlpatterns = [
path('my-feed/', MyFeed(), name='my-feed'),
]
Tests:
#pytest.mark.django_db
def test_feed_failing(client):
uri = reverse('news-feed')
resp = client.get(uri)
assert 'äöüß' in str(resp.content)
#pytest.mark.django_db
def test_feed_passing(client):
uri = reverse('news-feed')
resp = client.get(uri)
content = resp.content.decode(resp.charset)
assert 'äöüß' in content
One will fail, the other won't because of the correct encoding handling.
As for the check itself, personally I always prefer parsing the content to some meaningful data structure instead of working with raw string even for simple tests. For example, if you are checking for data in a text/html response, it's not much more overhead in writing
soup = bs4.BeautifulSoup(content, 'html.parser')
assert soup.select_one('h1#title-headliner') == '<h1>title</h1>'
or
root = lxml.etree.parse(io.StringIO(content), lxml.etree.HTMLParser())
assert next(root.xpath('//h1[#id='title-headliner']')).text == 'title'
than just
assert 'title' in content
However, invoking a parser is more explicit (you won't accidentally test for e.g. the title in page metadata in head) and also makes an implicit check for data integrity (e.g. you know that the payload is indeed valid HTML because parsed successfully).
To your example: in case of RSS feed, I'd simply use the XML parser:
from lxml import etree
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
root = etree.parse(io.BytesIO(resp.content))
title = root.xpath('//channel/title')[0].text
assert title == 'my title'
Here, I'm using lxml which is a faster impl of stdlib's xml. The advantage of parsing the content to an XML tree is also that the parser reads from bytestrings, taking care about the encoding handling - so you don't have to decode anything yourself.
Or use something high-level like atoma that ahs a nice API specifically for RSS entities, so you don't have to fight with XPath selectors:
import atoma
#pytest.mark.django_db
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
feed = atoma.parse_atom_bytes(resp.content)
assert feed.title.value == 'my title'
...When is it a good practice to do this vs. using selenium?
Short answer - you don't need it. I havent't paid much attention when reading your question and had HTML pages in mind when writing the comment. Regarding this selenium remark - this library handles all the low-level stuff, so when the tests start to accumulate in count (and usually, they do pretty fast), writing
uri = reverse('news-feed')
resp = client.get(uri)
root = parser.parse(resp.content)
assert root.query('some-query')
and dragging the imports along becomes too much hassle, so selenium can replace it with
driver = WebDriver()
driver.get(uri)
assert driver.find_element_by_id('my-element').text == 'my value'
Sure, testing with an automated browser instance has other advantages like seeing exactly what the user would see in real browser, allowing the pages to execute client-side javascript etc. But of course, all of this applies mainly to HTML pages testing; in case of testing against the RSS feed selenium usage is an overkill and Django's testing tools are more than enough.
My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)
Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)
Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.
I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`
I am working on an open-source project called RubberBand which is an open source project that allows you to do what the title says. Locally execute python file that is located on a web server, however I have run a problem. If a comma is located in a string (etc. "http:"), It Will return an error.
'''
RubberBand Version 1.0.1 'Indigo-Charlie'
http://www.lukeshiels.com/rubberband
CHANGE-LOG:
Changed Error Messages.
Changed Whole Code Into one function, rather than three.
Changed Importing required libraries into one line instead of two
'''
#Edit Below this line
import httplib, urlparse
def executeFromURL(url):
if (url == None):
print "!# RUBBERBAND_ERROR: No URL Specified #!"
else:
CORE = None
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
host, path = urlparse.urlparse(url)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
CORE = conn.getresponse().status
except StandardError:
CORE = None
if(CORE in good_codes):
exec(url)
else:
print "!# RUBBERBAND_ERROR: File Does Not Exist On WEBSERVER #!"
RubberBand in three lines without error checking:
import requests
def execute_from_url(url):
exec(requests.get(url).content)
You should use a return statement in your if (url == None): block as there is no point in carrying on with your function.
Where abouts in your code is the error, is there a full traceback as URIs with commas parse fine with the urlparse module.
Is it perhaps httplib.ResponseNotReady when calling CORE = conn.getresponse().status?
Nevermind that error message, that was me quickly testing your code and re-using the same connection object. I can't see what would be erroneous in your code.
I would suggest to check this question.
avoid comma in URL, that my suggestion.
Can I use commas in a URL?
This seems to work well for me:
import urllib
(fn,hd) = urllib.urlretrieve('http://host.com/file.py')
execfile(fn)
I prefer to use standard libraries, because I'm using python bundled with third party software (abaqus) which makes it a real headache to add packages.
I am trying to search through http://www.wegottickets.com/ with the keywords "Live music". But the returned result is still the main page, not the search result page including lots of live music information. Could anyone show me out what the problem is?
from urllib2 import urlopen
from ClientForm import ParseResponse
response = urlopen("http://www.wegottickets.com/")
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]
form.set_value("Live music", name="unified_query")
form.set_all_readonly(False)
control = form.find_control(type="submit")
print control.disabled
print control.readonly
#print form
request2 = form.click()
try:
response2 = urlopen(request2)
except:
print "Unsccessful query"
print response2.geturl()
print response2.info()
print response.read()
response2.close()
Thank you very much!
Never used it, but I've had success with the python mechanize module, if it turns out to be a fault in clientform.
However, as a first step, I'd suggest removing your try...except wrapper. What you're basically doing is saying "catch any error, then ignore the actual error and print 'Unsuccessful Query' instead". Not helpful for debugging. The exception will stop the program and print a useful error message, if you don't get in its way.