I'd like to search text on a lot of websites at once. From what I understand, of which I don't know if I understand correctly, this code shouldn't work well.
from twisted.python.threadpool import ThreadPool
from txrequests import Session
pages = ['www.url1.com', 'www.url2.com']
with Session(pool=ThreadPool(maxthreads=10)) as sesh:
for pagelink in pages:
newresponse = sesh.get(pagelink)
npages, text = text_from_html(newresponse.content)
My relevant functions are below (from this post), I don't think their exact contents (text extraction) are important but I'll list them just them in case.
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return soup.find_all('a'), u" ".join(t.strip() for t in visible_texts)
If I was executing this sesh.get() without the extra functions below, from what I understand: a request is sent, and perhaps it might take a while to come back with a response. In the time that it takes for this response to come, other requests are sent; and perhaps responded to before some prior requests.
If I was only making requests within my for loop, this would happen as planned. But if I put functions within the for loop, am I stopping the requests from being asynchronous? Wouldn't the functions wait for the response in order to be executed? How can I smoothly execute something like this?
I'd also be open to suggestions of different modules, I have no particular attachment to this one - I think it does what I need it to, but I realize it's not necessarily the only module that can.
Related
Say I'm testing an RSS feed view in a Django app, is this how I should go about it?
def test_some_view(...):
...
requested_url = reverse("personal_feed", args=[some_profile.auth_token])
resp = client.get(requested_url, follow=True)
...
assert dummy_object.title in str(resp.content)
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
Should I assert that dummy_object is in the response this way?
I'm testing here using the str representation of the response object. When is it a good practice to do this vs. using selenium? I know it makes it easier to verify that said obj or property (like dummy_object.title) is encapsulated within an H1 tag for example. On the other hand, if I don't care about how the obj is represented, it's faster to do it like the above.
Reevaluating my comment (didn't carefully read the question and overlooked the RSS feed stuff):
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
I would agree on that - from Django point, you are testing your views and don't care about what the exact endpoints they are mapped against. Using reverse is thus IMO the clear and correct approach.
Should I assert that dummy_object is in the response this way?
You have to pay attention here. response.content is a bytestring, so asserting dummy_object.title in str(resp.content) is dangerous. Consider the following example:
from django.contrib.syndication.views import Feed
class MyFeed(Feed):
title = 'äöüß'
...
Registered the feed in urls:
urlpatterns = [
path('my-feed/', MyFeed(), name='my-feed'),
]
Tests:
#pytest.mark.django_db
def test_feed_failing(client):
uri = reverse('news-feed')
resp = client.get(uri)
assert 'äöüß' in str(resp.content)
#pytest.mark.django_db
def test_feed_passing(client):
uri = reverse('news-feed')
resp = client.get(uri)
content = resp.content.decode(resp.charset)
assert 'äöüß' in content
One will fail, the other won't because of the correct encoding handling.
As for the check itself, personally I always prefer parsing the content to some meaningful data structure instead of working with raw string even for simple tests. For example, if you are checking for data in a text/html response, it's not much more overhead in writing
soup = bs4.BeautifulSoup(content, 'html.parser')
assert soup.select_one('h1#title-headliner') == '<h1>title</h1>'
or
root = lxml.etree.parse(io.StringIO(content), lxml.etree.HTMLParser())
assert next(root.xpath('//h1[#id='title-headliner']')).text == 'title'
than just
assert 'title' in content
However, invoking a parser is more explicit (you won't accidentally test for e.g. the title in page metadata in head) and also makes an implicit check for data integrity (e.g. you know that the payload is indeed valid HTML because parsed successfully).
To your example: in case of RSS feed, I'd simply use the XML parser:
from lxml import etree
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
root = etree.parse(io.BytesIO(resp.content))
title = root.xpath('//channel/title')[0].text
assert title == 'my title'
Here, I'm using lxml which is a faster impl of stdlib's xml. The advantage of parsing the content to an XML tree is also that the parser reads from bytestrings, taking care about the encoding handling - so you don't have to decode anything yourself.
Or use something high-level like atoma that ahs a nice API specifically for RSS entities, so you don't have to fight with XPath selectors:
import atoma
#pytest.mark.django_db
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
feed = atoma.parse_atom_bytes(resp.content)
assert feed.title.value == 'my title'
...When is it a good practice to do this vs. using selenium?
Short answer - you don't need it. I havent't paid much attention when reading your question and had HTML pages in mind when writing the comment. Regarding this selenium remark - this library handles all the low-level stuff, so when the tests start to accumulate in count (and usually, they do pretty fast), writing
uri = reverse('news-feed')
resp = client.get(uri)
root = parser.parse(resp.content)
assert root.query('some-query')
and dragging the imports along becomes too much hassle, so selenium can replace it with
driver = WebDriver()
driver.get(uri)
assert driver.find_element_by_id('my-element').text == 'my value'
Sure, testing with an automated browser instance has other advantages like seeing exactly what the user would see in real browser, allowing the pages to execute client-side javascript etc. But of course, all of this applies mainly to HTML pages testing; in case of testing against the RSS feed selenium usage is an overkill and Django's testing tools are more than enough.
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
I'm trying to scrape a site, when I run the following code without region_id=[any number from one to 32] I get a [500], but if I set region_id=1 I'll get only a first page by default (on the url it is pagina=&), pages are up to 500; is there a command or parameter for retrieving every page (every possible value of pagina=), avoiding for loops?
import requests
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas®ion_id=&parent_id=&pagina=&nombre="
resp = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
data = resp.json()
Even without a for loop, you are still going to need iteration. You could do it with recursion or map as I've done below, but the iteration is still there. This solution has the advantage that everything is a generator, so only when you ask for a page's json from all_data will url be formatted, the request made, checked and converted to json. I added a filter to make sure you got a valid response before trying to get the json out. It still makes every request sequentially, but you could replace map with a parallel implementation quite easily.
import requests
from itertools import product, starmap
from functools import partial
def is_valid_resp(resp):
return resp.status_code == requests.codes.ok
def get_json(resp):
return resp.json()
# There's a .format hiding on the end of this really long url,
# with {} in appropriate places
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas®ion_id={}&parent_id=&pagina={}&nombre=".format
regions = range(1, 33)
pages = range(1, 501)
urls = starmap(url, product(regions, pages))
moz_get = partial(requests.get, headers={'User-Agent':'Mozilla/5.0'})
responses = map(moz_get, urls)
valid_responses = filter(is_valid_response, responses)
all_data = map(get_json, valid_responses)
# all_data is a generator that will give you each page's json.
I'm currently working on an arcbot and I'm trying to make a command "!urbandictionary", it should scrape the meaning of a term, the first one which is provided by urbandictionary, if there's another solution, e.g. another dictionary site with a better api that's also good. Here's my code:
if Command.lower() == '!urban':
dictionary = Argument[1] #this is the term which the user provides, e.g. "scrape"
dictionaryscrape = urllib2.urlopen('http://www.urbandictionary.com/define.php?term='+dictionary).read() #plain html of the site
scraped = getBetweenHTML(dictionaryscrape, '<div class="meaning">','</div>') #Here's my problem, i'm not sure if it scrapes the first meaning or not..
messages.main(scraped,xSock,BotID) #Sends the meaning of the provided word (Argument[0])
How do I correctly scrape a meaning of a word in urbandictionary?
Just get the text from the meaning class:
import requests
from bs4 import BeautifulSoup
word = "scrape"
r = requests.get("http://www.urbandictionary.com/define.php?term={}".format(word))
soup = BeautifulSoup(r.content)
print(soup.find("div",attrs={"class":"meaning"}).text)
Gassing and breaking your car repeatedly really fast so that the front and rear bumpers "scrape" the pavement; while going hyphy
There is an unofficial api here apparently
`http://api.urbandictionary.com/v0/define?term={word}`
From https://github.com/zdict/zdict/wiki/Urban-dictionary-API-documentation
So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.