Say I'm testing an RSS feed view in a Django app, is this how I should go about it?
def test_some_view(...):
...
requested_url = reverse("personal_feed", args=[some_profile.auth_token])
resp = client.get(requested_url, follow=True)
...
assert dummy_object.title in str(resp.content)
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
Should I assert that dummy_object is in the response this way?
I'm testing here using the str representation of the response object. When is it a good practice to do this vs. using selenium? I know it makes it easier to verify that said obj or property (like dummy_object.title) is encapsulated within an H1 tag for example. On the other hand, if I don't care about how the obj is represented, it's faster to do it like the above.
Reevaluating my comment (didn't carefully read the question and overlooked the RSS feed stuff):
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
I would agree on that - from Django point, you are testing your views and don't care about what the exact endpoints they are mapped against. Using reverse is thus IMO the clear and correct approach.
Should I assert that dummy_object is in the response this way?
You have to pay attention here. response.content is a bytestring, so asserting dummy_object.title in str(resp.content) is dangerous. Consider the following example:
from django.contrib.syndication.views import Feed
class MyFeed(Feed):
title = 'äöüß'
...
Registered the feed in urls:
urlpatterns = [
path('my-feed/', MyFeed(), name='my-feed'),
]
Tests:
#pytest.mark.django_db
def test_feed_failing(client):
uri = reverse('news-feed')
resp = client.get(uri)
assert 'äöüß' in str(resp.content)
#pytest.mark.django_db
def test_feed_passing(client):
uri = reverse('news-feed')
resp = client.get(uri)
content = resp.content.decode(resp.charset)
assert 'äöüß' in content
One will fail, the other won't because of the correct encoding handling.
As for the check itself, personally I always prefer parsing the content to some meaningful data structure instead of working with raw string even for simple tests. For example, if you are checking for data in a text/html response, it's not much more overhead in writing
soup = bs4.BeautifulSoup(content, 'html.parser')
assert soup.select_one('h1#title-headliner') == '<h1>title</h1>'
or
root = lxml.etree.parse(io.StringIO(content), lxml.etree.HTMLParser())
assert next(root.xpath('//h1[#id='title-headliner']')).text == 'title'
than just
assert 'title' in content
However, invoking a parser is more explicit (you won't accidentally test for e.g. the title in page metadata in head) and also makes an implicit check for data integrity (e.g. you know that the payload is indeed valid HTML because parsed successfully).
To your example: in case of RSS feed, I'd simply use the XML parser:
from lxml import etree
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
root = etree.parse(io.BytesIO(resp.content))
title = root.xpath('//channel/title')[0].text
assert title == 'my title'
Here, I'm using lxml which is a faster impl of stdlib's xml. The advantage of parsing the content to an XML tree is also that the parser reads from bytestrings, taking care about the encoding handling - so you don't have to decode anything yourself.
Or use something high-level like atoma that ahs a nice API specifically for RSS entities, so you don't have to fight with XPath selectors:
import atoma
#pytest.mark.django_db
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
feed = atoma.parse_atom_bytes(resp.content)
assert feed.title.value == 'my title'
...When is it a good practice to do this vs. using selenium?
Short answer - you don't need it. I havent't paid much attention when reading your question and had HTML pages in mind when writing the comment. Regarding this selenium remark - this library handles all the low-level stuff, so when the tests start to accumulate in count (and usually, they do pretty fast), writing
uri = reverse('news-feed')
resp = client.get(uri)
root = parser.parse(resp.content)
assert root.query('some-query')
and dragging the imports along becomes too much hassle, so selenium can replace it with
driver = WebDriver()
driver.get(uri)
assert driver.find_element_by_id('my-element').text == 'my value'
Sure, testing with an automated browser instance has other advantages like seeing exactly what the user would see in real browser, allowing the pages to execute client-side javascript etc. But of course, all of this applies mainly to HTML pages testing; in case of testing against the RSS feed selenium usage is an overkill and Django's testing tools are more than enough.
Related
I want to scrape sample_info.csv file from https://depmap.org/portal/download/.
Since there is a React script on the website it's not that straightforward with BeautifulSoup and accessing the file via an appropriate tag. I did approach this from many angles and the one that gave me the best results looks like this and it returns the executed script where all downloaded files are listed together with other data. My then idea was to strip the tags and store the information in JSON. However, I think there must be some kind of mistake in the data because it is impossible to store it as JSON.
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
all_scripts = soup.find_all('script')
script = str(all_scripts[32])
last_char_index = script.rfind("}]")
first_char_index = script.find("[{")
script_cleaned = script[first_char_index:last_char_index+2]
script_json = json.loads(script_cleaned)
This code gives me an error
JSONDecodeError: Extra data: line 1 column 7250 (char 7249)
I know that my solution might not be elegant but it took me closest to the goal i.e. downloading the sample_info.csv file from the website. Not sure how to proceed here. If there are other options? I tried with selenium but this solution will not be feasible for the end-user of my script due to the driver path declaration
It is probably easier in this context to use regular expressions, since the string is invalid JSON.
This RegEx tool (https://pythex.org/) can be useful for testing expressions.
import re
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', script_cleaned)
#[
# ('https://ndownloader.figshare.com/files/26261524', 'CCLE_gene_cn.csv'),
# ('https://ndownloader.figshare.com/files/26261527', 'CCLE_mutations.csv'),
# ('https://ndownloader.figshare.com/files/26261293', 'Achilles_gene_effect.csv'),
# ('https://ndownloader.figshare.com/files/26261569', 'sample_info.csv'),
# ('https://ndownloader.figshare.com/files/26261476', 'CCLE_expression.csv'),
# ('https://ndownloader.figshare.com/files/17741420', 'primary_replicate_collapsed_logfold_change_v2.csv'),
# ('https://gygi.med.harvard.edu/publications/ccle', 'protein_quant_current_normalized.csv'),
# ('https://ndownloader.figshare.com/files/13515395', 'D2_combined_gene_dep_scores.csv')
# ]
Edit: This also works by passing the html_content directly (no need to BeautifulSoup).
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', html_content)
I'd like to search text on a lot of websites at once. From what I understand, of which I don't know if I understand correctly, this code shouldn't work well.
from twisted.python.threadpool import ThreadPool
from txrequests import Session
pages = ['www.url1.com', 'www.url2.com']
with Session(pool=ThreadPool(maxthreads=10)) as sesh:
for pagelink in pages:
newresponse = sesh.get(pagelink)
npages, text = text_from_html(newresponse.content)
My relevant functions are below (from this post), I don't think their exact contents (text extraction) are important but I'll list them just them in case.
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return soup.find_all('a'), u" ".join(t.strip() for t in visible_texts)
If I was executing this sesh.get() without the extra functions below, from what I understand: a request is sent, and perhaps it might take a while to come back with a response. In the time that it takes for this response to come, other requests are sent; and perhaps responded to before some prior requests.
If I was only making requests within my for loop, this would happen as planned. But if I put functions within the for loop, am I stopping the requests from being asynchronous? Wouldn't the functions wait for the response in order to be executed? How can I smoothly execute something like this?
I'd also be open to suggestions of different modules, I have no particular attachment to this one - I think it does what I need it to, but I realize it's not necessarily the only module that can.
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
I usually write function-only Python programs, but have decided on OOD approach (my first thereof) for my current program, a web-scraper:
import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
class Page(object):
def __init__(self, url):
self.url = url
def get_source(self):
self.source = urllib2.urlopen(url).read()
return self.source
class HubPage(Page):
def has_vacancies(self):
return not(any(text for text in NO_VACANCIES if text in self.source.lower()))
urls = []
with open('25.csv', 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
urls.append(row[0].strip())
for url in urls:
page = HubPage(url)
source = page.get_source()
if page.has_vacancies():
print 'Has vacancies'
Some context: HubPage represents a typical 'jobs' page on a company's web site. I am subclassing Page because I well eventually subclass it again for individual job pages, and some methods that will be used only to extract data for individual job pages (this may be overkill).
Here's my issue: I know from experience that urllib2, while it has its critics, is fast - very fast - at doing what it does, namely fetch a page's source. Yet I notice that in my design, processing of each url is taking a few orders of magnitude longer than what I typically observe.
Is it the fact that class instantiations are involved (unnecessarily,
perhaps)?
Might the fact that HubPage is inherited be at cause?
Is the call to any() known to be expensive when it contains a list comprehension such as it does here?
I'm interested in writing a short python script which uploads a short binary file (.wav/.raw audio) via a POST request to a remote server.
I've done this with pycurl, which makes it very simple and results in a concise script; unfortunately it also requires that the end
user have pycurl installed, which I can't rely on.
I've also seen some examples in other posts which rely only on basic libraries, urllib, urllib2, etc., however these generally seem to be quite verbose, which is also something I'd like to avoid.
I'm wondering if there are any concise examples which do not require the use of external libraries, and which will be quick and easy for 3rd parties to understand - even if they aren't particularly familiar with python.
What I'm using at present looks like,
def upload_wav( wavfile, url=None, **kwargs ):
"""Upload a wav file to the server, return the response."""
class responseCallback:
"""Store the server response."""
def __init__(self):
self.contents=''
def body_callback(self, buf):
self.contents = self.contents + buf
def decode( self ):
self.contents = urllib.unquote(self.contents)
try:
self.contents = simplejson.loads(self.contents)
except:
return self.contents
t = responseCallback()
c = pycurl.Curl()
c.setopt(c.POST,1)
c.setopt(c.WRITEFUNCTION, t.body_callback)
c.setopt(c.URL,url)
postdict = [
('userfile',(c.FORM_FILE,wavfile)), #wav file to post
]
#If there are extra keyword args add them to the postdict
for key in kwargs:
postdict.append( (key,kwargs[key]) )
c.setopt(c.HTTPPOST,postdict)
c.setopt(c.VERBOSE,verbose)
c.perform()
c.close()
t.decode()
return t.contents
this isn't exact, but it gives you the general idea. It works great, it's simple for 3rd parties to understand, but it requires pycurl.
POSTing a file requires multipart/form-data encoding and, as far as I know, there's no easy way (i.e. one-liner or something) to do this with the stdlib. But as you mentioned, there are plenty of recipes out there.
Although they seem verbose, your use case suggests that you can probably just encapsulate it once into a function or class and not worry too much, right? Take a look at the recipe on ActiveState and read the comments for suggestions:
Recipe 146306: Http client to POST using multipart/form-data
or see the MultiPartForm class in this PyMOTW, which seems pretty reusable:
PyMOTW: urllib2 - Library for opening URLs.
I believe both handle binary files.
I met similar issue today, after tried both and pycurl and multipart/form-data, I decide to read python httplib/urllib2 source code to find out, I did get one comparably good solution:
set Content-Length header(of the file) before doing post
pass a opened file when doing post
Here is the code:
import urllib2, os
image_path = "png\\01.png"
url = 'http://xx.oo.com/webserviceapi/postfile/'
length = os.path.getsize(image_path)
png_data = open(image_path, "rb")
request = urllib2.Request(url, data=png_data)
request.add_header('Cache-Control', 'no-cache')
request.add_header('Content-Length', '%d' % length)
request.add_header('Content-Type', 'image/png')
res = urllib2.urlopen(request).read().strip()
return res
see my blog post: http://www.2maomao.com/blog/python-http-post-a-binary-file-using-urllib2/
I know this is an old old stack, but I have a different solution.
If you went thru the trouble of building all the magic headers and everything, and are just UPSET that suddenly a binary file can't pass because python library is mean.. you can monkey patch a solution..
import httplib
class HTTPSConnection(httplib.HTTPSConnection):
def _send_output(self, message_body=None):
self._buffer.extend(("",""))
msg = "\r\n".join(self._buffer)
del self._buffer[:]
self.send(msg)
if message_body is not None:
self.send(message_body)
httplib.HTTPSConnection = HTTPSConnection
If you are using HTTP:// instead of HTTPS:// then replace all instances of HTTPSConnection above with HTTPConnection.
Before people get upset with me, YES, this is a BAD SOLUTION, but it is a way to fix existing code you really don't want to re-engineer to do it some other way.
Why does this fix it? Go look at the original Python source, httplib.py file.
How's urllib substantially more verbose? You build postdict basically the same way, except you start with
postdict = [ ('userfile', open(wavfile, 'rb').read()) ]
Once you vave postdict,
resp = urllib.urlopen(url, urllib.urlencode(postdict))
and then you get and save resp.read() and maybe unquote and try JSON-loading if needed. Seems like it would be actually shorter! So what am I missing...?
urllib.urlencode doesn't like some kinds of binary data.