Python OOP - web session - python

I have the following script:
import mechanize, cookielib, re ...
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.addheaders = ....
and do stuff
Because my script is growing very big, I want to split it in classes. One class to handle web-connection, one class to do stuff and so on.
From what I read, I need something like:
from web_session import * # this my class handling web-connection (cookies + auth)
from do_stuff import * # i do stuff on web pages
and in my main, I have:
browser = Web_session()
stuff = Do_stuff()
the problem for me is that I lose session cookies when I pass it to Do_stuff. Can anyone help me with a basic example of classes and interaction , lets say: I log in on site, a browse a page and I want to do something like re.findall("something", one_that_page). Thanks in advance
Update:
Main Script:
br = WebBrowser()
br.login(myId, myPass)
WebBrowser class:
class WebBrowser():
def __init__(self):
self.browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
self.browser.set_cookiejar(cj)
self.browser.addheaders = ....
def login(self, username, password):
self.username = username
self.password = password
self.browser.open(some site)
self.browser.submit(username, password)
def open(self, url):
self.url = url
self.browser.open(url)
def read(self, url):
self.url = url
page = self.browser.open(url).read()
return page
Current state:
This part works perfectly, I can login, but I lose the mechanize class "goodies" like open, post o read an url.
For example:
management = br.read("some_url.php")
all my cookies are gone (error:must be log in)
How can I fix it?

The "mechanise.Browser" class has all the functionality it seens you want to put on your "Web_session" class (side note - naming conventions and readility would recomend "WebSession" instead).
Anyway, you will retain you cookies if you keep the same Browser object across calls - if you really go for having another wrapper class, just create a mehcanize.Broser when instantiating your Web_session class, and keep that as an object attribute (for example, as "self.browser") .
But, you most likelly don't need to do that - just create a Browser on the __init__ of your Do_stuff, keep it as an instance attribute, and reuse it for all requests -
class DoStuff(object):
def __init__(self):
self.browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
self.browser.set_cookiejar(cj)
def login(self, credentials):
self.browser.post(data=credentials)
def match_text_at_page(self, url, text):
# this will use the same cookies as are used in the login
req = self.browser.get(url)
return re.findall(text, req.text)

Never use the construct from X import * as in
from web_session import *
from do_stuff import *
It's ok when you are experimenting in an interactive session, but don't use it in your code.
Imagine the following: In web_session.py you have a function called my_function, which you use in your main module. In do_stuff.pyyou have an import statement from some_lib_I_found_on_the_net import *. Everything is nice, but after a while, your program mysteriously fails. It turns out that you upgraded some_lib_I_found_on_the_net.py, and the new version contained a function called my_function. Your main program is suddenly calling some_lib_I_found_on_the_net.my_functioninstead of web_session.my_function. Python has such nice support for separating concerns, but with that lazy construct, you'll just shoot yourself in the foot, and besides, it's so nice to be able to look in your code and see where every object comes from, which you don't with the *.
If you want to avoid long things like web_session.myfunction(), do either import web_session as ws and then ws.my_function()or from web_session import my_function, ...
Even if you only import one single module in this way, it can bite you. I had colleagues who had something like...
...
import util
...
from matplotlib import *
...
(a few hundred lines of code)
...
x = util.some_function()
...
Suddenly, they got an AttributeError on the call to util.some_function which had worked as a charm for years. However they looked at the code, they couldn't understand what was wrong. It took a long time before someone realized that matplotlib had been upgraded, and now it contained a function called (you guessed it) util!
Explicit is better than implicit!

Related

Unable to use session within a classmethod of a web-scraper

I've created a python script using classmethod to fetch the profilename after loging in inputting the credentials in a webpage. The script is able to fetch the profilename in the right way. What I wish to do now is use session within classmethod. The session has already been defined within __init__() method. I would like to keep the existing design intact.
This is what I've tried so far:
import requests
from bs4 import BeautifulSoup
class StackOverflow:
SEARCH_URL = "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"
def __init__(self,session):
self.session = session
#classmethod
def crawl(cls,email,password):
page = requests.get(cls.SEARCH_URL,headers={"User-Agent":"Mozilla/5.0"})
sauce = BeautifulSoup(page.text, "lxml")
fkey = sauce.select_one("[name='fkey']")["value"]
payload = {"fkey": fkey,"email": email,"password": password,}
res = requests.post(cls.SEARCH_URL,data=payload,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
user = soup.select_one("div[class^='gravatar-wrapper-']").get("title")
yield user
if __name__ == '__main__':
with requests.Session() as s:
result = StackOverflow(s)
for item in result.crawl("email", "password"):
print(item)
How can I use session taking from __init__ within classmethod?
You can't access self.session from a class method. Method __init__ is called when an instance of the class is created, however class methods are not bound to any particular instance of the class, but the class itself - that's why the first parameter is usually cls and not self.
You decided to create the session in the __init__, so it can be assumed that
so1 = StackOverflow()
so2 = StackOverflow()
keep their sessions separate. If that is indeed your intention, the crawl method should not be annotated with #classmethod. If you have crawl(self, email, pass): then you will still be able to use StackOverflow.SEARCH_URL and self.__class__.SEARCH_URL to get the value defined in StackOverflow class, or self.SEARCH_URL which will by default get the same value, but could be changed with so1.SEARCH_URL = "sth else" (but so2.SEARCH_URL would keep it's original value)

Unit test mock urllib.request.urlretrieve() Python 3 and internal function

How can I mock or unit test a function/method that uses urllib.request.urlretrieve to save a file?
This is the part of code the I'm trying to to test:
from urllib import request
from config import cvs_site, proxy
class Collector(object):
"""Class Collector"""
...
def __init__(self, code_num=""):
self.code_num = sec_id.upper()
self.csv_file = "csv_files/code_num.csv"
# load proxy if it is configured
if proxy:
proxies = {"http": proxy, "https": proxy, "ftp": proxy}
proxy_connect = request.ProxyHandler(proxies)
opener = request.build_opener(proxy_connect)
request.install_opener(opener)
def _collect_data():
try:
print("\nAccessing to retrieve CVS informations.")
request.urlretrieve(cvs_site, self.cvs_file)
except error.URLError as e:
exit("\033[1;31m[ERROR]\033[1;00m {0}\n".format(e))
...
def some_function(self):
_collect_data()
...
Should I test all internal functions (_functions())?
How to mock it?
To solve it I did some modifications on my code, to then created the test with mock.
REMARK: I'm still learning unit tests and mock, any new comments here is good because I'm not confident that I went to correct way :)
the function _collect_data() there is not necessities to be inside of __init__(), then I moved it to outside.
the _collect_data is a function with specific action, save the file, but need to return something to able to mock works with this.
the argument was moved from Class to function
The code new looks like it:
from config import proxy
from config import cvs_site
from urllib import request
class Collector(object):
"""Class Collector"""
def __init__(self):
self.csv_file = "csv_files/code_num.csv"
# load proxy if it is configured
if proxy:
proxies = {"http": proxy, "https": proxy, "ftp": proxy}
proxy_connect = request.ProxyHandler(proxies)
opener = request.build_opener(proxy_connect)
request.install_opener(opener)
def _collect_data():
try:
print("\nAccessing to retrieve CVS informations.")
return request.urlretrieve(cvs_site, self.cvs_file)
except error.URLError as e:
return "\033[1;31m[ERROR]\033[1;00m {0}\n".format(e)
...
def some_function(self, code_num=""):
code_num = code_num.upper()
self._collect_data()
...
To test this code I create this:
import unittest
import mock
from saassist.datacollector import Collector
class TestCollector(unittest.TestCase):
def setUp(self):
self.apar_test = Collector()
#mock.patch("saassist.datacollector.request")
def test_collect_data(self, mock_collect_data):
mock_collect_data.urlretrieve.return_value = "File Collected OK"
self.assertEqual("File Collected OK", self.apar_test._collect_data())
Well, I don't know which more could be tested for it, but to start I think it is good :)

Scraping with the normally-fast urllib2 slowed by a number of factors - what are they?

I usually write function-only Python programs, but have decided on OOD approach (my first thereof) for my current program, a web-scraper:
import csv
import urllib2
NO_VACANCIES = ['no vacancies', 'not hiring']
class Page(object):
def __init__(self, url):
self.url = url
def get_source(self):
self.source = urllib2.urlopen(url).read()
return self.source
class HubPage(Page):
def has_vacancies(self):
return not(any(text for text in NO_VACANCIES if text in self.source.lower()))
urls = []
with open('25.csv', 'rb') as spreadsheet:
reader = csv.reader(spreadsheet)
for row in reader:
urls.append(row[0].strip())
for url in urls:
page = HubPage(url)
source = page.get_source()
if page.has_vacancies():
print 'Has vacancies'
Some context: HubPage represents a typical 'jobs' page on a company's web site. I am subclassing Page because I well eventually subclass it again for individual job pages, and some methods that will be used only to extract data for individual job pages (this may be overkill).
Here's my issue: I know from experience that urllib2, while it has its critics, is fast - very fast - at doing what it does, namely fetch a page's source. Yet I notice that in my design, processing of each url is taking a few orders of magnitude longer than what I typically observe.
Is it the fact that class instantiations are involved (unnecessarily,
perhaps)?
Might the fact that HubPage is inherited be at cause?
Is the call to any() known to be expensive when it contains a list comprehension such as it does here?

Retrieving Twitter data on the fly

Our company is trying to read in all live streams of data entered by random users, i.e., a random user sends off a tweet saying "ABC company".
Seeing as how you could use a twitter client to search for said text, I labour under the assumption that it's possible to aggregate all tweets that send off ones without using a client, i.e., to file, streaming in live without using hashtags.
What's the best way to do this? And if you've done this before, could you share your script? I reckon the simplest way would be via ruby/python script left running, but my understanding of ruby/python is limited at best.
Kindly help?
Here's a bare minimum:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import twitter
from threading import *
from os import _exit, urandom
from time import sleep
from logger import *
import unicodedata
## Based on: https://github.com/sixohsix/twitter
class twitt(Thread):
def __init__(self, tags = None, *args, **kwargs):
self.consumer_key = '...'
self.consumer_secret = '...'
self.access_key = '...'
self.access_secret = '...'
self.encoding = 'iso-8859-15'
self.args = args
self.kwargs = kwargs
self.searchapi = twitter.Twitter(domain="search.twitter.com").search
Thread.__init__(self)
self.start()
def search(self, tag):
try:
return self.searchapi(q=tag)['results']
except:
return {}
def run(self):
while 1:
sleep(3)
To use it, do something like:
if __name__ == "__main__":
t = twitt()
print t.search('#DHSupport')
t.alive = False
Note: The only reason this is threaded is because it's just a piece of code i had laying around for other projects, it gives you an idea how to work with the API and perhaps build a background service to fetch search results on twitter.
There's a lot of crap in my original code so the structure might look a bit odd.
Note that you don't really need the consumer_keys etc for just a search but you will need OAuth login for more features such as posting or checking messages.
The only two things you really need is:
import twitter
print twitter.Twitter(domain="search.twitter.com").search(q='#hashtag')['results']

Spynner crash python

I'm building a Django app and I'm using Spynner for web crawling. I have this problem and I hope someone can help me.
I have this function in the module "crawler.py":
import spynner
def crawling_js(url)
br = spynner.Browser()
br.load(url)
text_page = br.html
br.close (*)
return text_page
(*) I tried with br.close() too
in another module (eg: "import.py") I call the function in this way:
from crawler import crawling_js
l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]
for url in l_url:
mytextpage = crawling_js(url)
.. parse mytextpage....
when I pass the first url in to the function all is correct when I pass the second "url" python crash. Python crash in this line:br.load(url). Someone can help me? Thanks a lot
I have:
Django 1.3
Python 2.7
Spynner 1.1.0
PyQt4 4.9.1
Why you need to instantiate br = spynner.Browser() and close it every time you call crawling_js(). In a loop this will utilize a lot of resources which I think is the reason why it crashes. let's think of it like this, br is a browser instance. Therefore, you can make it browse any number of websites without the need to close it and open it again. Adjust your code this way:
import spynner
br = spynner.Browser() #you open it only once.
def crawling_js(url):
br.load(url)
text_page = br._get_html() #_get_html() to make sure you get the updated html
return text_page
then if you insist to close br later you simply do:
from crawler import crawling_js , br
l_url = ["https://www.google.com/", "https://www.tripadvisor.com/", ...]
for url in l_url:
mytextpage = crawling_js(url)
.. parse mytextpage....
br.close()

Categories

Resources