How can I mock or unit test a function/method that uses urllib.request.urlretrieve to save a file?
This is the part of code the I'm trying to to test:
from urllib import request
from config import cvs_site, proxy
class Collector(object):
"""Class Collector"""
...
def __init__(self, code_num=""):
self.code_num = sec_id.upper()
self.csv_file = "csv_files/code_num.csv"
# load proxy if it is configured
if proxy:
proxies = {"http": proxy, "https": proxy, "ftp": proxy}
proxy_connect = request.ProxyHandler(proxies)
opener = request.build_opener(proxy_connect)
request.install_opener(opener)
def _collect_data():
try:
print("\nAccessing to retrieve CVS informations.")
request.urlretrieve(cvs_site, self.cvs_file)
except error.URLError as e:
exit("\033[1;31m[ERROR]\033[1;00m {0}\n".format(e))
...
def some_function(self):
_collect_data()
...
Should I test all internal functions (_functions())?
How to mock it?
To solve it I did some modifications on my code, to then created the test with mock.
REMARK: I'm still learning unit tests and mock, any new comments here is good because I'm not confident that I went to correct way :)
the function _collect_data() there is not necessities to be inside of __init__(), then I moved it to outside.
the _collect_data is a function with specific action, save the file, but need to return something to able to mock works with this.
the argument was moved from Class to function
The code new looks like it:
from config import proxy
from config import cvs_site
from urllib import request
class Collector(object):
"""Class Collector"""
def __init__(self):
self.csv_file = "csv_files/code_num.csv"
# load proxy if it is configured
if proxy:
proxies = {"http": proxy, "https": proxy, "ftp": proxy}
proxy_connect = request.ProxyHandler(proxies)
opener = request.build_opener(proxy_connect)
request.install_opener(opener)
def _collect_data():
try:
print("\nAccessing to retrieve CVS informations.")
return request.urlretrieve(cvs_site, self.cvs_file)
except error.URLError as e:
return "\033[1;31m[ERROR]\033[1;00m {0}\n".format(e)
...
def some_function(self, code_num=""):
code_num = code_num.upper()
self._collect_data()
...
To test this code I create this:
import unittest
import mock
from saassist.datacollector import Collector
class TestCollector(unittest.TestCase):
def setUp(self):
self.apar_test = Collector()
#mock.patch("saassist.datacollector.request")
def test_collect_data(self, mock_collect_data):
mock_collect_data.urlretrieve.return_value = "File Collected OK"
self.assertEqual("File Collected OK", self.apar_test._collect_data())
Well, I don't know which more could be tested for it, but to start I think it is good :)
Related
I've created a python script using classmethod to fetch the profilename after loging in inputting the credentials in a webpage. The script is able to fetch the profilename in the right way. What I wish to do now is use session within classmethod. The session has already been defined within __init__() method. I would like to keep the existing design intact.
This is what I've tried so far:
import requests
from bs4 import BeautifulSoup
class StackOverflow:
SEARCH_URL = "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"
def __init__(self,session):
self.session = session
#classmethod
def crawl(cls,email,password):
page = requests.get(cls.SEARCH_URL,headers={"User-Agent":"Mozilla/5.0"})
sauce = BeautifulSoup(page.text, "lxml")
fkey = sauce.select_one("[name='fkey']")["value"]
payload = {"fkey": fkey,"email": email,"password": password,}
res = requests.post(cls.SEARCH_URL,data=payload,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
user = soup.select_one("div[class^='gravatar-wrapper-']").get("title")
yield user
if __name__ == '__main__':
with requests.Session() as s:
result = StackOverflow(s)
for item in result.crawl("email", "password"):
print(item)
How can I use session taking from __init__ within classmethod?
You can't access self.session from a class method. Method __init__ is called when an instance of the class is created, however class methods are not bound to any particular instance of the class, but the class itself - that's why the first parameter is usually cls and not self.
You decided to create the session in the __init__, so it can be assumed that
so1 = StackOverflow()
so2 = StackOverflow()
keep their sessions separate. If that is indeed your intention, the crawl method should not be annotated with #classmethod. If you have crawl(self, email, pass): then you will still be able to use StackOverflow.SEARCH_URL and self.__class__.SEARCH_URL to get the value defined in StackOverflow class, or self.SEARCH_URL which will by default get the same value, but could be changed with so1.SEARCH_URL = "sth else" (but so2.SEARCH_URL would keep it's original value)
this is a program input multiple urls calling url localhost:8888/api/v1/crawler
this program taking 1+hour to run its ok but it block other apis.
when it running other any api will not work till the existing api end so i want to run this program asynchronously so how can i achieve with the same program
#tornado.web.asynchronous
#gen.coroutine
#use_args(OrgTypeSchema)
def post(self, args):
print "Enter In Crawler Match Script POST"
print "Argsssss........"
print args
data = tornado.escape.json_decode(self.request.body)
print "Data................"
import json
print json.dumps(data.get('urls'))
from urllib import urlopen
from bs4 import BeautifulSoup
try:
urls = json.dumps(data.get('urls'));
urls = urls.split()
import sys
list = [];
# orig_stdout = sys.stdout
# f = open('out.txt', 'w')
# sys.stdout = f
for url in urls:
# print "FOFOFOFOFFOFO"
# print url
url = url.replace('"'," ")
url = url.replace('[', " ")
url = url.replace(']', " ")
url = url.replace(',', " ")
print "Final Url "
print url
try:
site = urlopen(url) ..............
Your post method is 100% synchronous. You should make the site = urlopen(url) async. There is an async HTTP client in Tornado for that. Also good example here.
You are using urllib which is the reason for blocking.
Tornado provides a non-blocking client called AsyncHTTPClient, which is what you should be using.
Use it like this:
from tornado.httpclient import AsyncHTTPClient
#gen.coroutine
#use_args(OrgTypeSchema)
def post(self, args):
...
http_client = AsyncHTTPClient()
site = yield http_client.fetch(url)
...
Another thing that I'd like to point out is don't import modules from inside a function. Although, it's not the reason for blocking but it is still slower than if you put all your imports at the top of file. Read this question.
I have the following script:
import mechanize, cookielib, re ...
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.addheaders = ....
and do stuff
Because my script is growing very big, I want to split it in classes. One class to handle web-connection, one class to do stuff and so on.
From what I read, I need something like:
from web_session import * # this my class handling web-connection (cookies + auth)
from do_stuff import * # i do stuff on web pages
and in my main, I have:
browser = Web_session()
stuff = Do_stuff()
the problem for me is that I lose session cookies when I pass it to Do_stuff. Can anyone help me with a basic example of classes and interaction , lets say: I log in on site, a browse a page and I want to do something like re.findall("something", one_that_page). Thanks in advance
Update:
Main Script:
br = WebBrowser()
br.login(myId, myPass)
WebBrowser class:
class WebBrowser():
def __init__(self):
self.browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
self.browser.set_cookiejar(cj)
self.browser.addheaders = ....
def login(self, username, password):
self.username = username
self.password = password
self.browser.open(some site)
self.browser.submit(username, password)
def open(self, url):
self.url = url
self.browser.open(url)
def read(self, url):
self.url = url
page = self.browser.open(url).read()
return page
Current state:
This part works perfectly, I can login, but I lose the mechanize class "goodies" like open, post o read an url.
For example:
management = br.read("some_url.php")
all my cookies are gone (error:must be log in)
How can I fix it?
The "mechanise.Browser" class has all the functionality it seens you want to put on your "Web_session" class (side note - naming conventions and readility would recomend "WebSession" instead).
Anyway, you will retain you cookies if you keep the same Browser object across calls - if you really go for having another wrapper class, just create a mehcanize.Broser when instantiating your Web_session class, and keep that as an object attribute (for example, as "self.browser") .
But, you most likelly don't need to do that - just create a Browser on the __init__ of your Do_stuff, keep it as an instance attribute, and reuse it for all requests -
class DoStuff(object):
def __init__(self):
self.browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
self.browser.set_cookiejar(cj)
def login(self, credentials):
self.browser.post(data=credentials)
def match_text_at_page(self, url, text):
# this will use the same cookies as are used in the login
req = self.browser.get(url)
return re.findall(text, req.text)
Never use the construct from X import * as in
from web_session import *
from do_stuff import *
It's ok when you are experimenting in an interactive session, but don't use it in your code.
Imagine the following: In web_session.py you have a function called my_function, which you use in your main module. In do_stuff.pyyou have an import statement from some_lib_I_found_on_the_net import *. Everything is nice, but after a while, your program mysteriously fails. It turns out that you upgraded some_lib_I_found_on_the_net.py, and the new version contained a function called my_function. Your main program is suddenly calling some_lib_I_found_on_the_net.my_functioninstead of web_session.my_function. Python has such nice support for separating concerns, but with that lazy construct, you'll just shoot yourself in the foot, and besides, it's so nice to be able to look in your code and see where every object comes from, which you don't with the *.
If you want to avoid long things like web_session.myfunction(), do either import web_session as ws and then ws.my_function()or from web_session import my_function, ...
Even if you only import one single module in this way, it can bite you. I had colleagues who had something like...
...
import util
...
from matplotlib import *
...
(a few hundred lines of code)
...
x = util.some_function()
...
Suddenly, they got an AttributeError on the call to util.some_function which had worked as a charm for years. However they looked at the code, they couldn't understand what was wrong. It took a long time before someone realized that matplotlib had been upgraded, and now it contained a function called (you guessed it) util!
Explicit is better than implicit!
I'm want to test my web service (built on Tornado) using tornado.testing.AsyncHTTPTestCase. It says here that using POST for AsyncHttpClients should look like the following.
from tornado.testing import AsyncHTTPTestCase
from urllib import urlencode
class ApplicationTestCase(AsyncHTTPTestCase):
def get_app(self):
return app.Application()
def test_file_uploading(self):
url = '/'
filepath = 'uploading_file.zip' # Binary file
data = ??????? # Read from "filepath" and put the generated something into "data"
self.http_client.fetch(self.get_url(url),
self.stop,
method="POST",
data=urlencode(data))
response = self.wait()
self.assertEqual(response.code, 302) # Do assertion
if __name__ == '__main__':
unittest.main()
The problem is that I've no idea what to write at ???????. Are there any utility functions built in Tornado, or is it better to use alternative libraries like Requests?
P.S.
... actually, I've tried using Requests, but my test stopped working because probably I didn't do good for asynchronous tasking
def test_file_uploading(self):
url = '/'
filepath = 'uploading_file.zip' # Binary file
files = {'file':open(filepath,'rb')}
r = requests.post(self.get_url(url),files=files) # Freezes here
self.assertEqual(response.code, 302) # Do assertion
You need to construct a multipart/form-data request body. This is officially defined in the HTML spec. Tornado does not currently have any helper functions for generating a multipart body. However, you can use the MultipartEncoder class from the requests_toolbelt package. Just use the to_string() method instead of passing the encoder object directly to fetch().
I have this function:
def function_to_test(..):
# some stuff
response = requests.post("some.url", data={'dome': 'data'})
# some stuff with response
I want to make a test, but mocking requests.post("some.url", data={'dome': 'data'}) because I know it works. I need something like:
def my tets():
patch('requests.post', Mock(return_value={'some': 'values'})).start()
# test for *function_to_test*
Is that possible? If so, how?
Edited
I found these ways:
class MockObject(object):
status_code = 200
content = {'some': 'data'}
# First way
patch('requests.post', Mock(return_value=MockObject())).start()
# Second way
import requests
requests.post = Mock(return_value=MockObject())
Are these approaches good? which one is better? another one?
You can use HTTPretty(https://github.com/gabrielfalcao/HTTPretty), a library made just for that kind of mocks.
you can use flexmock for that. see the example below
Say you have function in file ./myfunc.py
# ./myfunc.py
import requests
def function_to_test(..):
# some stuff
response = requests.post("www.google.com", data={'dome': 'data'})
return response
Then the testcase is in ./test_myfunc.py
# ./test_myfunc.py
import flexmock as flexmock
import myfunc
def test_myfunc():
(flexmock(myfunc.requests).should_recieve("post").with_args("www.google.com", data={"dome":
"data"}).and_return({'some': 'values'}))
resp = myfunc.function_to_test()
assert resp["some"] == "values"
Try this, see if this works or let me know for more enhancement/improvement.
After continue testing and testing I prefer to use this to mock requests. Hope it helps others.
class MockObject(object):
status_code = 200
content = {'some': 'data'}
patch('requests.post', Mock(return_value=MockObject())).start()