Unable to use session within a classmethod of a web-scraper - python

I've created a python script using classmethod to fetch the profilename after loging in inputting the credentials in a webpage. The script is able to fetch the profilename in the right way. What I wish to do now is use session within classmethod. The session has already been defined within __init__() method. I would like to keep the existing design intact.
This is what I've tried so far:
import requests
from bs4 import BeautifulSoup
class StackOverflow:
SEARCH_URL = "https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f"
def __init__(self,session):
self.session = session
#classmethod
def crawl(cls,email,password):
page = requests.get(cls.SEARCH_URL,headers={"User-Agent":"Mozilla/5.0"})
sauce = BeautifulSoup(page.text, "lxml")
fkey = sauce.select_one("[name='fkey']")["value"]
payload = {"fkey": fkey,"email": email,"password": password,}
res = requests.post(cls.SEARCH_URL,data=payload,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
user = soup.select_one("div[class^='gravatar-wrapper-']").get("title")
yield user
if __name__ == '__main__':
with requests.Session() as s:
result = StackOverflow(s)
for item in result.crawl("email", "password"):
print(item)
How can I use session taking from __init__ within classmethod?

You can't access self.session from a class method. Method __init__ is called when an instance of the class is created, however class methods are not bound to any particular instance of the class, but the class itself - that's why the first parameter is usually cls and not self.
You decided to create the session in the __init__, so it can be assumed that
so1 = StackOverflow()
so2 = StackOverflow()
keep their sessions separate. If that is indeed your intention, the crawl method should not be annotated with #classmethod. If you have crawl(self, email, pass): then you will still be able to use StackOverflow.SEARCH_URL and self.__class__.SEARCH_URL to get the value defined in StackOverflow class, or self.SEARCH_URL which will by default get the same value, but could be changed with so1.SEARCH_URL = "sth else" (but so2.SEARCH_URL would keep it's original value)

Related

How to split code into different python files

I have been working on an I/O bound application where I will run multiple scripts at the same time depending on the args I will call for a script etc: monitor.py --s="sydsvenskan", monitor.py -ss="bbc" etc etc.
from __future__ import annotations
from abc import abstractmethod
from typing import ClassVar, Dict
from typing import Optional
import attr
import requests
from selectolax.parser import HTMLParser
#attr.dataclass
class Info:
"""Scraped info about news"""
all_articles: set = attr.ib(factory=set)
store: str = attr.ib(factory=str)
name: Optional[str] = attr.ib(factory=str)
image: Optional[str] = attr.ib(factory=str)
class Scraper:
scrapers: ClassVar[Dict[str, Scraper]] = {}
domain: ClassVar[str]
def __init_subclass__(cls) -> None:
Scraper.scrapers[cls.domain] = cls
#classmethod
def for_url(cls, domain, url) -> Scraper:
return cls.scrapers[domain](url)
#abstractmethod
def scrape_feed(self):
pass
#abstractmethod
def scrape_product(self):
pass
class BBCScraper(Scraper):
domain = 'BBC'
def __init__(self, url):
self.url = url
def scrape_feed(self):
with requests.get(self.url) as rep:
# FIXME Better way than this atleast :P
if rep:
doc = HTMLParser(rep.text)
all_articles = {
f"https://www.BBC.se{product_link.attrs['href']}" for product_link in
doc.css('td.search-productnamne > a, div.product-image > a')
}
return Info(
store="BBC",
all_articles=all_articles
)
def scrape_product(self):
with requests.get(self.url) as rep:
# FIXME Better way than this atleast :P
if rep:
doc = HTMLParser(rep.text)
# FIXME Scrape valid webelements
name = "Test"
image = "Test"
return Info(
store="BBC",
name=name,
image=image,
)
class SydsvenskanScraper(Scraper):
domain = 'Sydsvenskan'
def __init__(self, url):
self.url = url
def scrape_feed(self):
with requests.get(self.url) as rep:
# FIXME Better way than this atleast :P
if rep:
doc = HTMLParser(rep.text)
all_articles = {
f"https://Sydsvenskan.se/{product_link.attrs['href']}" for product_link in
doc.css('div.product-image > a, td.search-productnamne > a')
}
return Info(
store="Sydsvenskan",
all_articles=all_articles
)
def scrape_product(self):
with requests.get(self.url) as rep:
# FIXME Better way than this atleast :P
if rep:
doc = HTMLParser(rep.text)
# FIXME Scrape valid webelements
name = "Test"
image = "Test"
return Info(
store="Sydsvenskan",
name=name,
image=image,
)
if __name__ == "__main__":
#FIXME Use arguments instead
domain = 'BBC'
url = 'https://www.bbc.co.uk/'
scraper = Scraper.for_url(domain, url)
r = scraper.scrape_feed()
print(r)
As you can currently see I have "hardcoded":
domain = 'BBC'
url = 'https://www.bbc.co.uk/'
which will currently be passed through arguments instead.
However as we can see if I start to add more "stores/news sites" in the class Scraper e.g. 40 different site, it would be pretty hard to navigate to the correct code if you want to maintain or do any changes.
I wonder how I can in that case split the code into different files where etc Sydsvenska will be for itself and BBC will be by itself. I can then easier maintain the code in the future if there will be any changes.
Ok I understand what you're looking for. And sorry to say you're out of luck. At least as far as my knowledge of python goes. You can do it two ways.
Use importlib to search through a folder/package tha contains those files and imports them into a list or dict to be retrieved. However you said you wanted to avoid this but either way you would have to use importlib. And #2 is the reason why.
Use a Base class that when inherited it's __init__ call adds the Derived class to a list or object that stores it and you can retrieve it via a class object. However the issue here is that if you move your derived class into a new file, that code wont run until you import it. So you would still need to explicitly import the file or implicitly import it via importlib (dynamic import).
So you'll have to use importlib (dynamic import) either way.

lxml etree cleanup_namespaces returns None instead of cleaned tree

I wrote a small class for scraping a webpage holding some documents inside folders, all of these being hosted on S3. I converted the response into an XML tree, where I need to clean each elements from the prefix URL.
Here's the code and issues:
import requests
from lxml import etree
class scraper():
def __init__(self, BASE_URL, headers):
self.BASE_URL = BASE_URL
self.headers = headers
self.URL = self.BASE_URL + '?delimiter=/'
def clean_root(self, root):
"Needed to clean the URL prefix in front of each XML element"
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
return etree.cleanup_namespaces(root)
def get_root_folder_names(self):
"Retrieve the folders"
res = requests.get(self.URL, headers=self.headers)
root = etree.XML(res.content)
print(f"{root}") # prints: "root: <Element {http://s3.amazonaws.com/doc/2016-11-11/}ListBucketResult at 0x8f87b456e441>"
print(f"{self.clean_root(root)}") # prints: "None", where it must prints "<Element ListBucketResult at 0x8f87b456e441>"
call it:
myInstance = scraper(BASE_URL, headers)
myInstance.get_root_folder_names()
If I call clean_tree(root) from the get_root_folder_names function, the result is None as if it was never applied. But root does exist just before the call to this function as it gets correctly printed. I get inspired from here: https://www.kite.com/python/answers/how-to-call-an-instance-method-in-the-same-class-in-python
What am I doing wrong?
I also tried to use the clean_root function without self. but then, when I call it from the get_root_folder_names function, I got NameError: name 'clean_tree' is not defined.
The problem isn’t really about calling functions from other functions. It’s confusing pure functions with those that have side effects.
The function cleanup_namespaces returns None. It modifies the tree, rather than creating a new one (This is like the problem beginners often have with list.sort).
Change the end of the clean_root function to this:
etree.cleanup_namespaces(root)
return root

How to fetch a new URL and update response object outside scrapy shell?

Using scrapy shell I can use the fetch method to grasp the content of a new URL. This method automatically update response object, as the following code demonstrate (see the comment):
items = response.css('article')
for i in items:
title = i.css('a[class="item-link "]::text').get()
price = i.css('span[class="item-price h2-simulated"]::text').get()
rooms = i.css('span[class="item-detail"]::text').get()
meters = i.css('span[class="item-detail"]:nth-of-type(2)::text').get()
url = response.urljoin(i.css('a[class="item-link "]').attrib['href'])
fetch(url, redirect=False)
# After fetch() is called the response object is updated
# and I can fetch new informations
loc = response.xpath('/html/body/script[9]').get()
latitude = re.findall('latitude:"([-.0-9]+)"', loc)[0]
longitude = re.findall('longitude:"([-.0-9]+)"', loc)[0]
But I don't know how to do the same outside scrapy shell, using Spider class and its methods.

Unable to make a bridge between two classes

I've written some code in python and my intention is to supply the newly produced links by "web_parser" class to the "get_docs" class. However, I can't think of anything productive to do so. All I wanna do is bridge a connection between the two classes so that the "web_parser" class produce links and the "get_docs" class process them to get the refined output. Any idea as to how I can do it flawlessly will be highly appreciated. Thanks in advance.
from lxml import html
import requests
class web_parser:
page_link = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA"
main_url = "https://www.yellowpages.com"
def __init__(self, link):
self.link = link
self.vault = []
def parser(self):
self.get_link(self.page_link)
def get_link(self, url):
page = requests.get(url)
tree = html.fromstring(page.text)
item_links = tree.xpath('//h2[#class="n"]/a[#class="business-name"][not(#itemprop="name")]/#href')
for item_link in item_links:
self.vault.append(self.main_url + item_link)
class get_docs(web_parser):
def __init__(self, new_links):
web_parser.__init__(self, link)
self.new_links = [new_links]
def procuring_links(self):
for link in self.vault:
self.using_links(link)
def using_links(self, newly_created_link):
page = requests.get(newly_created_link)
tree = html.fromstring(page.text)
name = tree.findtext('.//div[#class="sales-info"]/h1')
phone = tree.findtext('.//p[#class="phone"]')
print(name, phone)
if __name__ == '__main__':
crawl = web_parser(web_parser.page_link)
parse = get_docs(crawl)
parse.parser()
parse.procuring_links()
I know a very little about creating classes so please forgive my ignorance. Upon execution at this stage I get an error:
web_parser.__init__(self, link)
NameError: name 'link' is not defined
I'm not very sure how you want to use it, by giving a parameter to web_parser or use an hardcoded link inside the class ?
From the commands you are using in __main__, you could process like below:
class get_docs(object):
def __init__(self, web_parser):
self.vault = web_parser.vault
if __name__ == '__main__':
crawl = web_parser() # create an instance
crawl.parser()
parse = get_docs(crawl) # give the instance to get_doc, or directly the vault with crawl.vault
parse.procuring_links() # execute get_doc processing
__
You'll need to correct the web_parser class too:
you have to choose between a parameter given during creation (link), or the hardcoded page_link, just adapt the method parser() to target the good one.
class web_parser:
def __init__(self, link=''):
self.link = link
self.vault = []
self.page_link = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA"
self.main_url = "https://www.yellowpages.com"
To fix the NameError you posted in your question, you need to add another parameter to __init__ of your subclass - and pass something to it.
class get_docs(web_parser):
#def __init__(self, new_links):
def __init__(self, link, new_links):
web_parser.__init__(self, link)
self.new_links = [new_links]
Although web_parser doesn't seem to do anything with that data so maybe just remove it from the base class.

Python OOP - web session

I have the following script:
import mechanize, cookielib, re ...
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.addheaders = ....
and do stuff
Because my script is growing very big, I want to split it in classes. One class to handle web-connection, one class to do stuff and so on.
From what I read, I need something like:
from web_session import * # this my class handling web-connection (cookies + auth)
from do_stuff import * # i do stuff on web pages
and in my main, I have:
browser = Web_session()
stuff = Do_stuff()
the problem for me is that I lose session cookies when I pass it to Do_stuff. Can anyone help me with a basic example of classes and interaction , lets say: I log in on site, a browse a page and I want to do something like re.findall("something", one_that_page). Thanks in advance
Update:
Main Script:
br = WebBrowser()
br.login(myId, myPass)
WebBrowser class:
class WebBrowser():
def __init__(self):
self.browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
self.browser.set_cookiejar(cj)
self.browser.addheaders = ....
def login(self, username, password):
self.username = username
self.password = password
self.browser.open(some site)
self.browser.submit(username, password)
def open(self, url):
self.url = url
self.browser.open(url)
def read(self, url):
self.url = url
page = self.browser.open(url).read()
return page
Current state:
This part works perfectly, I can login, but I lose the mechanize class "goodies" like open, post o read an url.
For example:
management = br.read("some_url.php")
all my cookies are gone (error:must be log in)
How can I fix it?
The "mechanise.Browser" class has all the functionality it seens you want to put on your "Web_session" class (side note - naming conventions and readility would recomend "WebSession" instead).
Anyway, you will retain you cookies if you keep the same Browser object across calls - if you really go for having another wrapper class, just create a mehcanize.Broser when instantiating your Web_session class, and keep that as an object attribute (for example, as "self.browser") .
But, you most likelly don't need to do that - just create a Browser on the __init__ of your Do_stuff, keep it as an instance attribute, and reuse it for all requests -
class DoStuff(object):
def __init__(self):
self.browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
self.browser.set_cookiejar(cj)
def login(self, credentials):
self.browser.post(data=credentials)
def match_text_at_page(self, url, text):
# this will use the same cookies as are used in the login
req = self.browser.get(url)
return re.findall(text, req.text)
Never use the construct from X import * as in
from web_session import *
from do_stuff import *
It's ok when you are experimenting in an interactive session, but don't use it in your code.
Imagine the following: In web_session.py you have a function called my_function, which you use in your main module. In do_stuff.pyyou have an import statement from some_lib_I_found_on_the_net import *. Everything is nice, but after a while, your program mysteriously fails. It turns out that you upgraded some_lib_I_found_on_the_net.py, and the new version contained a function called my_function. Your main program is suddenly calling some_lib_I_found_on_the_net.my_functioninstead of web_session.my_function. Python has such nice support for separating concerns, but with that lazy construct, you'll just shoot yourself in the foot, and besides, it's so nice to be able to look in your code and see where every object comes from, which you don't with the *.
If you want to avoid long things like web_session.myfunction(), do either import web_session as ws and then ws.my_function()or from web_session import my_function, ...
Even if you only import one single module in this way, it can bite you. I had colleagues who had something like...
...
import util
...
from matplotlib import *
...
(a few hundred lines of code)
...
x = util.some_function()
...
Suddenly, they got an AttributeError on the call to util.some_function which had worked as a charm for years. However they looked at the code, they couldn't understand what was wrong. It took a long time before someone realized that matplotlib had been upgraded, and now it contained a function called (you guessed it) util!
Explicit is better than implicit!

Categories

Resources