Recursively calling function on certain condition - python

I have a function that extracts the content from a random website every time using beautifulsoup library where I get random content every time. I'm successfully able to extract the content..... but let's say (if the output text is 'abc'). I want to re-call the function again and again until I get a different output. I added an if condition to make it done but somehow it's not working as I thought:
class MyClass:
def get_comment(self):
source = requests.get('https://www.example.com/random').text
soup = BeautifulSoup(source, 'lxml')
comment = soup.find('div', class_='commentMessage').span.text
if comment == "abc":
logging.warning('Executing again....')
self.get_comment() #Problem here....Not executing again
return comment
mine = MyClass()
mine.get_comment() # I get 'abc' output

When you call your function recursively you aren't doing anything with the output:
class MyClass:
def get_comment(self):
source = requests.get('https://www.example.com/random').text
soup = BeautifulSoup(source, 'lxml')
comment = soup.find('div', class_='commentMessage').span.text
if comment == "abc":
logging.warning('Executing again....')
return self.get_comment() #Call the method again, AND return result from that call
else:
return comment #return unchanged
mine = MyClass()
mine.get_comment()
I think this should be more like what you're after.

Related

Scraper collecting the content of first page only

I've written a scraper using python to scrape movie names from yiffy torrents. The webpage has traversed around 12 pages. If i run my crawler using print statement, it gives me all the results from all the pages. However, when I run the same using return then it gives me the content from the first page only and do not go on to the next page to process the rest. As I'm having a hard time understanding the behavior of return statement, if somebody points out where I'm going wrong and give me a workaround I would be very happy. Thanks in advance.
This is what I'm trying with (the full code):
import requests
from urllib.request import urljoin
from lxml.html import fromstring
main_link = "https://www.yify-torrent.org/search/western/"
# film_storage = [] #I tried like this as well (keeping the list storage outside the function)
def get_links(link):
root = fromstring(requests.get(link).text)
film_storage = []
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
film_storage.append(name)
return film_storage
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
if __name__ == '__main__':
items = get_links(main_link)
for item in items:
print(item)
But, when i do like below, i get all the results (pasted gist portion only):
def get_links(link):
root = fromstring(requests.get(link).text)
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
print(name) ## using print i get all the results from all the pages
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
Your return statement prematurely terminates your get_links() function. Meaning this part
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
get_links(full_link)
is never executed.
Quickfix would be to put the return statement at the end of your function, but you have to make film_storage global(defined outside the get_links() function).
Edit:
Just realized, since you will be making your film_storage global, there is no need for the return statement.
Your code in main would just look like this:
get_links(main_link)
for item in film_storage:
print(item)
Your film_storage results list is local to the function get_links() which is called recursively for the next page. After the recursive call (for all the next pages), the initial (entry) function returns results only for the first page.
You'll have to either (1) unwrap the tail recursion into a loop, (2) make results list global; (3) use a callback (like you call print), or the best option (4) is to turn the get_links function into a generator that yields results for all pages.
Generator version:
def get_links(link):
root = fromstring(requests.get(link).text)
for item in root.cssselect(".mv"):
name = item.cssselect("h3 a")[0].text
yield name
next_page = root.cssselect(".pager a:contains('Next')")[0].attrib['href'] if root.cssselect(".pager a:contains('Next')") else ""
if next_page:
full_link = urljoin(link,next_page)
for name in get_links(full_link):
yield name

Calling a function(method) inside a class [duplicate]

This question already has answers here:
How can I call a function within a class?
(2 answers)
Closed 6 years ago.
I wanted to call the function spider which is within a class with the parameters such as url,word and maxPages.
when I try calling it the following way I get an error because spider() gets more than 3 arguments (it gets 4 arguments instead).
Please can someone guide me as to how I can call the function which is within a class correctly.
My code looks like this:
import HTMLParser
from urllib2 import urlopen
from pandas.io.parsers import TextParser
class LinkParser(HTMLParser.HTMLParser):
#other methods
def spider(url,word,maxPages):
pagesTovisit = [url]
numberVisited=0
foundWord = False
maxPages = 0
while numberVisited < maxPages and pagesTovisit != [] and not foundWord:
numberVisited = numberVisited +1
url = pagesTovisit[0]
pagesTovisit = pagesTovisit[1:]
try:
print numberVisited, "Visiting:", url
parser = LinkParser()
data, links = parser.getLinks(url)
if data.find(word)>-1:
foundWord = True
pagesTovisit = pagesTovisit +links
print "Success"
except:
print "failed"
if foundWord:
print "the word",word,"was found at",url
else:
print "word not found"
url = raw_input("enter the url: ")
word = raw_input("enter the word to search for: ")
maxPages = raw_input("the max pages you want to search in for are: ")
lp=LinkParser()
lp.spider(url,word,maxPages)
Your indentation in the post is all wrong but I assume spider is in the class. You need to add the self keyword as first argument to the function to make it a method:
class LinkParser(HTMLParser.HTMLParser):
def spider(self,url,word,maxPages):
...
Inside your spider method there is a call to LinkParser.getLinks(). Instead of creating another instance of the class you should call the method by: self.getLinks(...) as this won't create new instances.
Also class methods and members can be reached inside methods by writing:
def methodOfClass(self,additionalArguments):
self.memberName
self.methodName(methodArguments)
Ignoring the indentation errors which I believe are only copy-paste issues
Every method in Python implicitly recieves the instance it is called upon as the first argument, so its definition should count for that.
Change def spider(url, word, maxPages) to def spider(self, url, word, maxPages).

How to use function's return within other function (without using class)

I have a function:
def search_result(request):
if request.method =='POST':
data = request.body
qd = QueryDict(data)
place = qd.values()[2]
indate = qd.values()[3]
outdate = qd.values()[0]
url = ('http://terminal2.expedia.com/x/mhotels/search?city=%s&checkInDate=%s&checkOutDate=%s&room1=2&apikey=%s') %(place, indate, outdate, MY_API_KEY)
req = requests.get(url).text
json_data = json.loads(req)
results = []
for hotels in json_data.get('hotelList'):
results.append(hotels.get('localizedName'))
return HttpResponse(results)
now I want to use func1's return within other function to render template something like this:
def search_page(request):
r = search_result(request)
d = r.content
return render(request,'search.html', {'res':d})
and this actually do not work.
Does any way exist to do what I want (without using class)?
I make post via ajax from template form and my first function works properly and prints result in console. The problems occurs when I try use my response in next function to render it in template. Thats why I ask to help me. Have you any ideas to make my response from first function visible for another function?
You have defined func1 to take the request parameter, but when you call it in your second function you do not pass it any arguments.
If you pass in request it should work.
EDIT: You are looking for the results, so I suppose you can just return them instead of the HttpResponse (maybe we need more information on what you are trying to accomplish)
def func1(request):
......
results = []
for hotels in json_data.get('hotelList'):
results.append(hotels.get('localizedName'))
return results
def funk2(request):
f = funk1(request)
return render(request,'search.html', {'res':f})

Recursion in Python not working

I am using python to get the main category of a wiki page by constantly picking the first one in the category list. However, when I wrote the python code to do recursion, it kept returning the first argument I parse in even though I try to change it in the method.
import csv
from bs4 import BeautifulSoup
import urllib.request
string_set=[]
def get_first_category(url):
k=urllib.request.urlopen(url)
soup=BeautifulSoup(k)
s=soup.find_all('a')
for i in s:
string_set.append(i.string)
for i in range(-len(string_set), 0):
if string_set[i] == ("Categories"):
return (string_set[i + 1])
def join_with(k):
return k.replace(" ","_")
def get_category_page(k):
p=["https://en.wikipedia.org/wiki/Category:",k]
return "".join(p)
def return_link(url):
return get_category_page(join_with(get_first_category(url)))
file=open("Categories.csv")
categories=csv.reader(file)
categories=zip(*categories)
def find_category(url):
k=get_first_category(url)
for i in categories:
if k in i:
return [True,i[0]]
return [False,k]
def main(url):
if find_category(url)[0]:
return find_category(url)[1]
else:
print(find_category(url)[1])
return main(return_link(url))
print (main('https://en.wikipedia.org/wiki/Category:International_charities'))
The category csv is shared:
Categories.csv
Ideally, the main method should keep going to the first category link until it meets something that is in categories.csv, but it just keep going to the link I parsed in.
def main(url):
if find_category(url)[0]:
return find_category(url)[1]
else:
print(find_category(url)[1])
return main(return_link(url))
Here, you call the find_category function which is supposed to return the new url, then you print it, and then you call main with the original url again. You do not change the value of url based on the return value of find_category. So it keeps repeating itself.

Calling a Function within a secondary Function, or calling a Function defined within a larger Function

I'm trying to use python to create a small look up program, that would show all the current prices of a theoretical portfolio, and then offer the option to basically refresh your portfolio, or look up a new quote of your choice.
I can get everything to work in the program, the problem I'm having is with the defined functions.
If you look at run_price1(), you'll notice that it is identical to that of run_price(); however run_price() is located within the update function.
If I take it out of the update function, the update function doesn't work. If I don't also list it somewhere outside of the update function, the later user input doesn't work.
The question: I am looking for either a way to call a function that is defined within another function, or a way to use a previously defined function inside of a secondary function.
My code:
import mechanize
from bs4 import BeautifulSoup
def run_price1():
myBrowser = mechanize.Browser()
htmlPage=myBrowser.open(web_address)
htmlText=htmlPage.get_data()
mySoup = BeautifulSoup(htmlText)
myTags = mySoup.find_all("span", id=tag_id)
myPrice = myTags[0].string
print"The current price of, {} is: {}".format(ticker.upper(), myPrice)
def update():
my_stocks = ["aapl","goog","sne","msft","spy","trgt","petm","fslr","fb","f","t"]
counter = 0
while counter < len(my_stocks):
web_address = "http://finance.yahoo.com/q?s={}".format(my_stocks[counter])
ticker = my_stocks[counter]
#'yfs_l84_yhoo' - that 1(one) is really a lowercase "L"
tag_id = "yfs_l84_{}".format(ticker.lower())
def run_price():
myBrowser = mechanize.Browser()
htmlPage=myBrowser.open(web_address)
htmlText=htmlPage.get_data()
mySoup = BeautifulSoup(htmlText)
myTags = mySoup.find_all("span", id=tag_id)
myPrice = myTags[0].string
print"The current price of, {} is: {}".format(ticker.upper(), myPrice)
run_price()
counter=counter+1
update()
ticker = ""
while ticker != "end":
ticker = raw_input("Type 'update', to rerun portfolio, 'end' to stop program, or a lowercase ticker to see price: ")
web_address = "http://finance.yahoo.com/q?s={}".format(ticker.lower())
tag_id = "yfs_l84_{}".format(ticker.lower())
if ticker == "end":
print"Good Bye"
elif ticker == "update":
update()
else:
run_price1()
You can simply call run_price1() from the update() function, where you now call run_price.
Functions defined at the top of a module are global in the module, so other functions can simply refer to those by name and call them.
Any value that your function needs does need to be passed in as an argument:
def run_price1(web_address, tag_id):
# ...
def update():
my_stocks = ["aapl","goog","sne","msft","spy","trgt","petm","fslr","fb","f","t"]
counter = 0
while counter < len(my_stocks):
web_address = "http://finance.yahoo.com/q?s={}".format(my_stocks[counter])
ticker = my_stocks[counter]
#'yfs_l84_yhoo' - that 1(one) is really a lowercase "L"
tag_id = "yfs_l84_{}".format(ticker.lower())
run_price1(web_address, tag_id)
counter=counter+1

Categories

Resources