I'm building an automation test for finding any possible dead links in a WP plugin. To this end, I have two helper functions.
The first spins up a Selenium webdriver:
#pytest.fixture(scope='session')
def setup():
d = webdriver.Firefox()
site = os.getenv('TestSite')
d.get(site)
login = d.find_element_by_id('user_login')
d.implicitly_wait(5)
user, pw = os.getenv('TestUser'), os.getenv('TestPw')
d.find_element(By.ID, 'user_login').send_keys(user)
d.find_element(By.ID, 'user_pass').send_keys(pw)
d.find_element(By.ID, 'wp-submit').click()
yield d, site
d.quit()
The second reads in a JSON file, picking out each page in the file then yielding it to the third function:
def page_generator() -> Iterator[Dict[str, Any]]:
try:
json_path = '../linkList.json'
doc = open(json_path)
body = json.loads(doc.read())
doc.close()
except FileNotFoundError:
print(f'Did not find a file at {json_path}')
exit(1)
for page in body['pages']:
yield page
The third function is the meat and potatoes of the test, running through the page and looking for each link. My goal is to parametrize each page, with my function header presently looking like...
#pytest.mark.parametrize('spin_up, page', [(setup, p) for p in page_generator()])
def test_links(spin_up, page):
driver, site = spin_up
# Do all the things
Unfortunately, running this results in TypeError: cannot unpack non-iterable function object. Is the only option to stick yield d, site inside some sort of loop to turn the function into an iterable, or is there a way to tell test_links to iteratively pull the same setup function as its spin_up value?
There is an easy way to do what you want, and that's to specify the setup fixture directly as an argument to test_links. Note that pytest is smart enough to figure out that the setup argument refers to a fixture while the page argument refers to a parametrization:
#pytest.mark.parametrize('page', page_generator())
def test_links(setup, page):
driver, site = setup
You might also take a look at parametrize_from_file. This is a package I wrote to help with loading test cases from JSON/YAML/TOML files. It basically does the same thing as your page_generator() function, but more succinctly (and more generally).
Related
I've created a script in python to scrape the user_name from a site's landing page and title from it's inner page. I'm trying to use concurrent.futures library to perform parallel tasks. I know how to use executor.submit() within the script below, so I'm not interested to go that way. I would like to go for executor.map() which I've already defined (perhaps in the wrong way) within the following script.
I've tried with:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures
URL = "https://stackoverflow.com/questions/tagged/web-scraping"
base = "https://stackoverflow.com"
def get_links(s,url):
res = s.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".summary"):
user_name = item.select_one(".user-details > a").get_text(strip=True)
post_link = urljoin(base,item.select_one(".question-hyperlink").get("href"))
yield s,user_name,post_link
def fetch(s,name,url):
res = s.get(url)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] > a").text
return name,title
if __name__ == '__main__':
with requests.Session() as s:
with futures.ThreadPoolExecutor(max_workers=5) as executor:
link_list = [url for url in get_links(s,URL)]
for result in executor.map(fetch, *link_list):
print(result)
I get the following error when I run the above script as is:
TypeError: fetch() takes 3 positional arguments but 50 were given
If I run the script modifying this portion link_list = [url for url in get_links(s,URL)][0], I get the following error:
TypeError: zip argument #1 must support iteration
How can I successfully execute the above script keeping the existing design intact?
Because fetch takes 3 arguments (s,name,url), you need to to pass 3 iterables to executor.map().
When you do this:
executor.map(fetch, *link_list)
link_list unpacks 49 or so tuples each with 3 elements (the Session object, username, and url). That's not what you want.
What you need to do is first transform link_list into 3 separate iterables (one for the Session objects, another for the usernames, and one for the urls). Instead of doing this manually, you can use zip() and the unpacking operator twice, like so:
for result in executor.map(fetch, *zip(*link_list)):
Also, when I tested your code, an exception was raised in get_links:
user_name = item.select_one(".user-details > a").get_text(strip=True)
AttributeError: 'NoneType' object has no attribute 'get_text'
item.select_one returned None, which obviously doesn't have a get_text() method, so I just wrapped that in a try/except block, catching AttributeError and continued the loop.
Also note that Requests' Session class isn't thread-safe. Luckily, the script returned sane responses when I ran it, but if you need your script to be reliable, you need to address this. A comment in the 2nd link shows how to use one Session instance per thread thanks to thread-local data. See:
Document threading contract for Session class
Thread-safety of FutureSession
Is there a way to write a step that works for multiple keywords. Like say my feature is:
Scenario: Something happens after navigating
Given I navigate to "/"
And say some cookie gets set
When I navigate to "/some-other-page"
Then something happens because of that cookie
I'm trying to avoid having to define both:
#given('I navigate to "{uri}"')
def get(context, uri):
current_url = BASE_URL + uri
context.driver.get(current_url)
#when('I navigate to "{uri}"')
def get(context, uri):
current_url = BASE_URL + uri
context.driver.get(current_url)
If you only define one and try to use it as both you get a raise NotImplementedError(u'STEP: error. With the above example it's not that bad because the step is simple but it seems like it's bad practice to repeat code and you could have the same thing happening with something more complicated, to me it seems like it would make sense if there was like a #all or #any keyword.
Apologies if this has been answered somewhere but it's a hard thing to search for as it's hard to find unique search terms for this type of question
It turns out this can be done using #step. e.g.
from behave import step
#step('I navigate to "{uri}"')
def step_impl(context, uri):
current_url = BASE_URL + uri
context.driver.get(current_url)
Will work for:
Scenario: Demo how #step can be used for multiple keywords
Given I navigate to "/"
When I navigate to "/"
Then I navigate to "/"
Note: Figured this out from the ticket which led to this file.
If you do not want to use #step, you can also do:
#and(u'I navigate to "{uri}"')
#when(u'I navigate to "{uri}"')
#given(u'I navigate to "{uri}"')
def get(context, uri):
current_url = BASE_URL + uri
context.driver.get(current_url)
There are conceptual differences between what the #given, #when and #then decorators stand for. You have situations where a step applies to all, some, or only one.
It is all to easy to use #step when the step only applies to 2 situations, and then rely on the test writers to only use the step in the right circumstances. I would encourage people not to do that. Use the following example instead.
#given('step text')
#when('step text')
def step_impl(context):
pass
However when a step is truly applicable to all cases, a good example is a delay step, then use the #step decorator:
#step('delay {duration} second(s)')
def step_impl(context, duration):
time.sleep(float(duration))
You can try something like this:
As a web user
Given I navigate to "/"and say some cookie gets set
Then I navigate to "/some-other-page"
And something happens because of that cookie
Its working for me. When you write "And" statement following "Then" statement, it considers it as two "Then" statements. Also you should include u' inside given and then statements parenthesis.
Try as below:
#given(u'I navigate to "{uri}"')
def get(context, uri):
current_url = BASE_URL + uri
context.driver.get(current_url)
#given(u'say some cookie gets set')
def get(context, uri):
current_url = BASE_URL + uri
context.driver.get(current_url)
#then(u'I navigate to "/some-other-page"')
def step_impl(context):
//your code
#then(u'something happens because of that cookie')
def step_impl(context):
//your code
I don't really have idea about that so I'd like you to give me some advice if you can.
Generally when I use Selenium I try to search the element that I'm interested in, but now I was thinking to develop some kind of performance test so check how much time take a specific webpage (html, script, etc...) to load.
Do you have some idea how to know the load time of html, script etc without search for a specific element of the page?
PS I use IE or Firefox
You could check the underlying javascript framework for active connections. When there are no active connections you could then assume the page is finished loading.
That, however, requires that you either know what framework the page uses, or that you must systematically check for different frameworks and then check for connections.
def get_js_framework(driver):
frameworks = [
'return jQuery.active',
'return Ajax.activeRequestCount',
'return dojo.io.XMLHTTPTransport.inFlight.length'
]
for f in frameworks:
try:
driver.execute_script(f)
except Exception:
logging.debug("{0} didn't work, trying next js framework".format(f))
continue
else:
return f
else:
return None
def load_page(driver, link):
timeout = 5
begin = time.time()
driver.get(link)
js = _get_js_framework(driver)
if js:
while driver.execute_script(js) and time.time() < begin + timeout:
time.sleep(0.25)
else:
time.sleep(timeout)
I'm building a web crawler. some of the the data I input into datastore get saved, others do not get saved and I have no idea what is the problem.
here is my crawler class
class Crawler(object):
def get_page(self, url):
try:
req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"}) # yessss!!! with the header, I am able to download pages
#response = urlfetch.fetch(url, method='GET')
#return response.content
#except urlfetch.InvalidURLError as iu:
# return iu.message
response = urllib2.urlopen(req)
return response.read()
except urllib2.HTTPError as e:
return e.reason
def get_all_links(self, page):
return re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',page)
def union(self, lyst1, lyst2):
try:
for elmt in lyst2:
if elmt not in lyst1:
lyst1.append(elmt)
return lyst1
except e:
return e.reason
#function that crawls the web for links starting from the seed
#returns a dictionary of index and graph
def crawl_web(self, seed="http://tonaton.com/"):
query = Listings.query() #create a listings object from storage
if query.get():
objListing = query.get()
else:
objListing = Listings()
objListing.toCrawl = [seed]
objListing.Crawled = []
start_time = datetime.datetime.now()
while datetime.datetime.now()-start_time < datetime.timedelta(0,5):#tocrawl (to crawl can take forever)
try:
#while True:
page = objListing.toCrawl.pop()
if page not in objListing.Crawled:
content = self.get_page(page)
add_page_to_index(page, content)
outlinks = self.get_all_links(content)
graph = Graph() #create a graph object with the url
graph.url = page
graph.links = outlinks #save all outlinks as the value part of the graph url
graph.put()
self.union(objListing.toCrawl, outlinks)
objListing.Crawled.append(page)
except:
return False
objListing.put() #save to database
return True #return true if it works
the classes that define the various ndb Models are in this python module:
import os
import urllib
from google.appengine.ext import ndb
import webapp2
class Listings(ndb.Model):
toCrawl = ndb.StringProperty(repeated=True)
Crawled = ndb.StringProperty(repeated=True)
#let's see how this works
class Index(ndb.Model):
keyword = ndb.StringProperty() # keyword part of the index
url = ndb.StringProperty(repeated=True) # value part of the index
#class Links(ndb.Model):
# links = ndb.JsonProperty(indexed=True)
class Graph(ndb.Model):
url = ndb.StringProperty()
links = ndb.StringProperty(repeated=True)
it used to work fine when I had JsonProperty in place of StringProperty(repeated=true). but JsonProperty is limited to 1500 bytes so I had an error once.
now, when I run the crawl_web member function, it actually crawls but when I check datastore it's only the Index entity that is created. No Graph, no Listing. please help. thanks.
Putting your code together, adding the missing imports, and logging the exception, eventually shows the first killer problem:
Exception Indexed value links must be at most 500 characters
and indeed, adding a logging of outlinks, one easily eyeballs that several of them are far longer than 500 characters -- therefore they can't be items in an indexed property, such as a StringProperty. Changing each repeated StringProperty to a repeated TextProperty (so it does not get indexed and thus has no 500-characters-per-item limitation), the code runs for a while (making a few instances of Graph) but eventually dies with:
An error occured while connecting to the server: Unable to fetch URL: https://sb':'http://b')+'.scorecardresearch.com/beacon.js';document.getElementsByTagName('head')[0].appendChild(s); Error: [Errno 8] nodename nor servname provided, or not known
and indeed, it's pretty obvious tht the alleged "link" is actually a bunch of Javascript and as such cannot be fetched.
So, essentially, the core bug in your code is not at all related to app engine, but rather, the issue is that your regular expression:
'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
does not properly extract outgoing links given a web page containing Javascript as well as HTML.
There are many issues with your code, but to this point they're just slowing it down or making it harder to understand, not killing it -- what's killing it is using that regular expression pattern to try and extract links from the page.
Check out retrieve links from web page using python and BeautifulSoup -- most answers suggest, for the purpose of extracting links from a page, using BeautifulSoup, which may perhaps be a problem in app engine, but one shows how to do it with just Python and REs.
I'm trying to pass in some parameters to the link_clicks and link_countries bit.ly APIs with python however I'm not sure of the syntax to pass the parameters here. How can I add parameters to this api call?
import sys
import bitly_api
import os
from config import config
#connect to bitly
conn_btly = bitly_api.Connection(access_token=config['ACCESS_TOKEN'])
#get links
links = conn_btly.user_link_history()
print 'links okay'
for link in links:
#add params to link
link_full = link['link'] + '?rollup=false'
print link_full
#get clicks
clicks = conn_btly.link_clicks(link_full)
#print results
#print link['link'], clicks
print clicks
the resulting output is
links okay
http://mzl.la/19xSyCT?rollup=false
...
BitlyError: NOT_FOUND
You need to pass in rollup as a keyword parameter instead:
clicks = conn_btly.link_clicks(link['link'], rollup=False)
You are expected to pass in a Python boolean value. The parameter is not part of the bit.ly URL, it is a parameter to the API call instead.
All optional API parameters (so apart from the link), are passed in as keyword parameters, including unit, units, tz_offset and limit.
You can take a look at the internal method that handles these parameters if you are so inclined.