How to pass variable number of functions as arguments in Python - python

I'd like to use method assesion.run(), which expects pointers to async functions. Number of functions differs, so I tried to make a list of functions and spread it in .run() arguments section. But when appending them to list, they immediately invoke. How to generate variable number of functions please?
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
urls = [
'https://python.org/',
'https://reddit.com/',
'https://google.com/'
]
async def get_html(url):
r = await asession.get(url)
return r
functionList = []
for url in urls:
functionList.append(get_html(url))
# results = asession.run(fn1, fn2, fn3, fn4, ...)
results = asession.run(*functionList)
for result in results:
print(result.html.url)

You have to pass the async functions not the coruoutines. The run function calls the arguments to create the coruoutines and then create the tasks.
So when you pass the coroutines run invoke them instead of creating tasks.
The functions are called without argument, so you should use functools.partial to put in the argument:
from functools import partial
...
for url in urls:
functionList.append(partial(get_html, url))

Related

Parametrizing a static webdriver fixture and a generator in Pytest?

I'm building an automation test for finding any possible dead links in a WP plugin. To this end, I have two helper functions.
The first spins up a Selenium webdriver:
#pytest.fixture(scope='session')
def setup():
d = webdriver.Firefox()
site = os.getenv('TestSite')
d.get(site)
login = d.find_element_by_id('user_login')
d.implicitly_wait(5)
user, pw = os.getenv('TestUser'), os.getenv('TestPw')
d.find_element(By.ID, 'user_login').send_keys(user)
d.find_element(By.ID, 'user_pass').send_keys(pw)
d.find_element(By.ID, 'wp-submit').click()
yield d, site
d.quit()
The second reads in a JSON file, picking out each page in the file then yielding it to the third function:
def page_generator() -> Iterator[Dict[str, Any]]:
try:
json_path = '../linkList.json'
doc = open(json_path)
body = json.loads(doc.read())
doc.close()
except FileNotFoundError:
print(f'Did not find a file at {json_path}')
exit(1)
for page in body['pages']:
yield page
The third function is the meat and potatoes of the test, running through the page and looking for each link. My goal is to parametrize each page, with my function header presently looking like...
#pytest.mark.parametrize('spin_up, page', [(setup, p) for p in page_generator()])
def test_links(spin_up, page):
driver, site = spin_up
# Do all the things
Unfortunately, running this results in TypeError: cannot unpack non-iterable function object. Is the only option to stick yield d, site inside some sort of loop to turn the function into an iterable, or is there a way to tell test_links to iteratively pull the same setup function as its spin_up value?
There is an easy way to do what you want, and that's to specify the setup fixture directly as an argument to test_links. Note that pytest is smart enough to figure out that the setup argument refers to a fixture while the page argument refers to a parametrization:
#pytest.mark.parametrize('page', page_generator())
def test_links(setup, page):
driver, site = setup
You might also take a look at parametrize_from_file. This is a package I wrote to help with loading test cases from JSON/YAML/TOML files. It basically does the same thing as your page_generator() function, but more succinctly (and more generally).

Unable to let a script populate results using concurrent.futures in a customized manner

I've created a script in python to scrape the user_name from a site's landing page and title from it's inner page. I'm trying to use concurrent.futures library to perform parallel tasks. I know how to use executor.submit() within the script below, so I'm not interested to go that way. I would like to go for executor.map() which I've already defined (perhaps in the wrong way) within the following script.
I've tried with:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import concurrent.futures as futures
URL = "https://stackoverflow.com/questions/tagged/web-scraping"
base = "https://stackoverflow.com"
def get_links(s,url):
res = s.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".summary"):
user_name = item.select_one(".user-details > a").get_text(strip=True)
post_link = urljoin(base,item.select_one(".question-hyperlink").get("href"))
yield s,user_name,post_link
def fetch(s,name,url):
res = s.get(url)
soup = BeautifulSoup(res.text,"lxml")
title = soup.select_one("h1[itemprop='name'] > a").text
return name,title
if __name__ == '__main__':
with requests.Session() as s:
with futures.ThreadPoolExecutor(max_workers=5) as executor:
link_list = [url for url in get_links(s,URL)]
for result in executor.map(fetch, *link_list):
print(result)
I get the following error when I run the above script as is:
TypeError: fetch() takes 3 positional arguments but 50 were given
If I run the script modifying this portion link_list = [url for url in get_links(s,URL)][0], I get the following error:
TypeError: zip argument #1 must support iteration
How can I successfully execute the above script keeping the existing design intact?
Because fetch takes 3 arguments (s,name,url), you need to to pass 3 iterables to executor.map().
When you do this:
executor.map(fetch, *link_list)
link_list unpacks 49 or so tuples each with 3 elements (the Session object, username, and url). That's not what you want.
What you need to do is first transform link_list into 3 separate iterables (one for the Session objects, another for the usernames, and one for the urls). Instead of doing this manually, you can use zip() and the unpacking operator twice, like so:
for result in executor.map(fetch, *zip(*link_list)):
Also, when I tested your code, an exception was raised in get_links:
user_name = item.select_one(".user-details > a").get_text(strip=True)
AttributeError: 'NoneType' object has no attribute 'get_text'
item.select_one returned None, which obviously doesn't have a get_text() method, so I just wrapped that in a try/except block, catching AttributeError and continued the loop.
Also note that Requests' Session class isn't thread-safe. Luckily, the script returned sane responses when I ran it, but if you need your script to be reliable, you need to address this. A comment in the 2nd link shows how to use one Session instance per thread thanks to thread-local data. See:
Document threading contract for Session class
Thread-safety of FutureSession

creating a generator in a threaded manner

How Can I create a thread generator to give output as soon as the first finishes,i.e I have client.decode which does some nltk then gives back a string for each link in links it will give a different string (regardless of link)
from scrapy.http import JsonRequest
import nltk,json,os
class Hd3Spider(scrapy.Spider):
name = 'hd3'
def start_requests(self):
url = get_url('http://httpbin.org/anything')
for di in links:
decoded_str = client.decode(type=4,) #function that takes about 40-100 sec
data = {'decoded_str': decoded_str}}
yield JsonRequest(url, data=data, callback=self.parse,meta={'original':di['enc']})
def parse(self,response):
#rest of code
what I want is to speed things up,instead of solving one at a time in the for loop I would like to get the output of function as soon as they are produced

Problems with tornado coroutine. Doesn't run asynchronously

After some painful attempts I wrote something like this:
urls=[
'http://localhost',
'http://www.baidu.com',
'http://www.taobao.com',
'http://www.163.com',
'http://www.sina.com',
'http://www.qq.com',
'http://www.jd.com',
'http://www.amazon.cn',
]
#tornado.gen.coroutine
def fetch_with_coroutine(url):
response=yield tornado.httpclient.AsyncHTTPClient().fetch(url)
print url,len(response.body)
raise tornado.gen.Return(response.body)
#tornado.gen.coroutine
def main():
for url in urls:
yield fetch_with_coroutine(url)
timestart=time.time()
tornado.ioloop.IOLoop.current().run_sync(main)
print 'async:',time.time()-timestart
but it's even a little slower than the synchronous code. In addition the order of output is always the same so I think it doesn't run asynchronously.
What's wrong with my code?
In main(), you're calling fetch_with_coroutine one at a time; the way you're using yield means that the second fetch can't start until the first is finished. Instead, you need to start them all first and wait for them with a single yield:
#gen.coroutine
def main():
# 'fetches' is a list of Future objects.
fetches = [fetch_with_coroutine(url) for url in urls]
# 'responses' is a list of those Futures' results
# (i.e. HTTPResponse objects).
responses = yield fetches

Fire off function without waiting for answer (Python)

I have a stream of links coming in, and I want to check them for rss every now and then. But when I fire off my get_rss() function, it blocks and the stream halts. This is unnecessary, and I'd like to just fire-and-forget about the get_rss() function (it stores its results elsewhere.)
My code is like thus:
self.ff.get_rss(url) # not async
print 'im back!'
(...)
def get_rss(url):
page = urllib2.urlopen(url) # not async
soup = BeautifulSoup(page)
I'm thinking that if I can fire-and-forget the first call, then I can even use urllib2 wihtout worrying about it not being async. Any help is much appreciated!
Edit:
Trying out gevent, but like this nothing happens:
print 'go'
g = Greenlet.spawn(self.ff.do_url, url)
print g
print 'back'
# output:
go
<Greenlet at 0x7f760c0750f0: <bound method FeedFinder.do_url of <rss.FeedFinder object at 0x2415450>>(u'http://nyti.ms/SuVBCl')>
back
The Greenlet seem to be registered, but the function self.ff.do_url(url) doesn't seem to be run at all. What am I doing wrong?
Fire and forget using the multiprocessing module:
def fire_and_forget(arg_one):
# do stuff
...
def main_function():
p = Process(target=fire_and_forget, args=(arg_one,))
# you have to set daemon true to not have to wait for the process to join
p.daemon = True
p.start()
return "doing stuff in the background"
here is sample code for thread based method invocation additionally desired threading.stack_size can be added to boost the performance.
import threading
import requests
#The stack size set by threading.stack_size is the amount of memory to allocate for the call stack in threads.
threading.stack_size(524288)
def alpha_gun(url, json, headers):
#r=requests.post(url, data=json, headers=headers)
r=requests.get(url)
print(r.text)
def trigger(url, json, headers):
threading.Thread(target=alpha_gun, args=(url, json, headers)).start()
url = "https://raw.githubusercontent.com/jyotiprakash-work/Live_Video_steaming/master/README.md"
payload="{}"
headers = {
'Content-Type': 'application/json'
}
for _ in range(10):
print(i)
#for condition
if i==5:
trigger(url=url, json =payload, headers=headers)
print('invoked')
You want to use the threading module or the multiprocessing module and save the result either in database, a file or a queue.
You also can use gevent.

Categories

Resources