After some painful attempts I wrote something like this:
urls=[
'http://localhost',
'http://www.baidu.com',
'http://www.taobao.com',
'http://www.163.com',
'http://www.sina.com',
'http://www.qq.com',
'http://www.jd.com',
'http://www.amazon.cn',
]
#tornado.gen.coroutine
def fetch_with_coroutine(url):
response=yield tornado.httpclient.AsyncHTTPClient().fetch(url)
print url,len(response.body)
raise tornado.gen.Return(response.body)
#tornado.gen.coroutine
def main():
for url in urls:
yield fetch_with_coroutine(url)
timestart=time.time()
tornado.ioloop.IOLoop.current().run_sync(main)
print 'async:',time.time()-timestart
but it's even a little slower than the synchronous code. In addition the order of output is always the same so I think it doesn't run asynchronously.
What's wrong with my code?
In main(), you're calling fetch_with_coroutine one at a time; the way you're using yield means that the second fetch can't start until the first is finished. Instead, you need to start them all first and wait for them with a single yield:
#gen.coroutine
def main():
# 'fetches' is a list of Future objects.
fetches = [fetch_with_coroutine(url) for url in urls]
# 'responses' is a list of those Futures' results
# (i.e. HTTPResponse objects).
responses = yield fetches
Related
I'd like to use method assesion.run(), which expects pointers to async functions. Number of functions differs, so I tried to make a list of functions and spread it in .run() arguments section. But when appending them to list, they immediately invoke. How to generate variable number of functions please?
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
urls = [
'https://python.org/',
'https://reddit.com/',
'https://google.com/'
]
async def get_html(url):
r = await asession.get(url)
return r
functionList = []
for url in urls:
functionList.append(get_html(url))
# results = asession.run(fn1, fn2, fn3, fn4, ...)
results = asession.run(*functionList)
for result in results:
print(result.html.url)
You have to pass the async functions not the coruoutines. The run function calls the arguments to create the coruoutines and then create the tasks.
So when you pass the coroutines run invoke them instead of creating tasks.
The functions are called without argument, so you should use functools.partial to put in the argument:
from functools import partial
...
for url in urls:
functionList.append(partial(get_html, url))
As is:
I built a function that takes an url as argument, scrapes the page and puts the parsed info into a list. Next to this, I have a list of the urls and I'm mapping the list of urls to the url parser function and iterating through each url. The issue is that I have around 7000-8000 links so parsing iteratively takes a lot of time. This is the current iterative solution:
mapped_parse_links = map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))
'parse' is the scraper function and 'my_new_list' is the list of URLs.
To be:
I want to implement multiprocessing so that instead of iterating through the list of URLs, it would utilize multiple CPUs to pick up more links at the same time and parse the info using the parse function. I tried the following:
import multiprocessing
with multiprocessing.Pool() as p:
mapped_parse_links = p.map(parse, my_new_list)
all_parsed = list(it.chain.from_iterable(mapped_parse_links))
I tried different solutions using the Pool function as well, however all of the solutions run for eternity. Can someone give me pointers on how to solve this?
Thanks.
Taken, with minor alterations, from the docs for concurrent.futures:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
if __name__ == '__main__':
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
# Do something with the scraped data here
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
You will have to substitute your parse function in for load_url.
I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:
import requests ; from lxml import html
import asyncio
link = "http://quotes.toscrape.com/"
async def quotes_scraper(base_link):
response = requests.get(base_link)
tree = html.fromstring(response.text)
for titles in tree.cssselect("span.tag-item a.tag"):
processing_docs(base_link + titles.attrib['href'])
async def processing_docs(base_link):
response = requests.get(base_link).text
root = html.fromstring(response)
for soups in root.cssselect("div.quote"):
quote = soups.cssselect("span.text")[0].text
author = soups.cssselect("small.author")[0].text
print(quote, author)
next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
if next_page:
page_link = link + next_page
processing_docs(page_link)
loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()
Upon execution what I see in the console is:
RuntimeWarning: coroutine 'processing_docs' was never awaited
processing_docs(base_link + titles.attrib['href'])
You need to call processing_docs() with await.
Replace:
processing_docs(base_link + titles.attrib['href'])
with:
await processing_docs(base_link + titles.attrib['href'])
And replace:
processing_docs(page_link)
with:
await processing_docs(page_link)
Otherwise it tries to run an asynchronous function synchronously and gets upset!
I have this simple code which fetches page via urllib:
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
Now I can read the result via result.read() but I was wondering if all this functionality can be done outside the for loop. Because other URLs to be fetched will wait until all the result has been processed.
I want to process result outside the for loop. Can this be done?
One of the ways to do this maybe to have result as a dictionary. What you can do is:
result = {}
for eachBrowser in browser_list:
result[eachBrowser]= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
and use result[BrowserName] outside the loop.
Hope this helps.
If you simply wants to access all results outside the loop just append all results to a array or dictionary as above answer.
Or if you trying to speed up your task try multithreading.
import threading
class myThread (threading.Thread):
def __init__(self, result):
threading.Thread.__init__(self)
self.result=result
def run(self):
// process your result(as self.result) here
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
myThread(result).start() // it will start processing result on another thread and continue loop without any waiting
Its a simple way of multithrading. It may break depending on your result processing. Consider reading the documentation and some examples before you try.
You can use threads for this:
import threading
import urllib2
from urlparse import urljoin
def worker(url):
res = urllib2.urlopen(url)
data = res.read()
res.close()
browser_list = ['Chrome', 'Mozilla', 'Safari', 'Internet Explorer', 'Opera']
user_string_url='http://www.useragentstring.com/'
for browser in browser_list:
url = urljoin(user_string_url, browser)
threading.Thread(target=worker,args=[url]).start()
# wait for everyone to complete
for thread in threading.enumerate():
if thread == threading.current_thread(): continue
thread.join()
Are you using python3?, if so, you can use futures for this task:
from urllib.request import urlopen
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
browser_list = ['Chrome','Mozilla','Safari','Internet+Explorer','Opera']
user_string_url = "http://www.useragentstring.com/pages/"
def process_request(url, future):
print("Processing:", url)
print("Reading data")
print(future.result().read())
with ThreadPoolExecutor(max_workers=10) as executor:
submit = executor.submit
for browser in browser_list:
url = urljoin(user_string_url, browser) + '/'
submit(process_request, url, submit(urlopen, url))
You could also do this with yield:
def collect_browsers():
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
yield eachBrowser, urllib2.urlopen(urljoin(user_string_url,eachBrowser))
def process_browsers():
for browser, result in collect_browsers():
do_something (result)
This is still a synchronous call (browser 2 will not fire until browser 1 is processed) but you can keep the logic for dealing with the results separate from the logic managing the connections. You could of course also use threads to handle the processing asynchronously with or without yield
Edit
Just re-read OP and should repeat that yield doesn't provide multi-threaded, asynchronous execution in case that was not clear in my first answer!
I have a stream of links coming in, and I want to check them for rss every now and then. But when I fire off my get_rss() function, it blocks and the stream halts. This is unnecessary, and I'd like to just fire-and-forget about the get_rss() function (it stores its results elsewhere.)
My code is like thus:
self.ff.get_rss(url) # not async
print 'im back!'
(...)
def get_rss(url):
page = urllib2.urlopen(url) # not async
soup = BeautifulSoup(page)
I'm thinking that if I can fire-and-forget the first call, then I can even use urllib2 wihtout worrying about it not being async. Any help is much appreciated!
Edit:
Trying out gevent, but like this nothing happens:
print 'go'
g = Greenlet.spawn(self.ff.do_url, url)
print g
print 'back'
# output:
go
<Greenlet at 0x7f760c0750f0: <bound method FeedFinder.do_url of <rss.FeedFinder object at 0x2415450>>(u'http://nyti.ms/SuVBCl')>
back
The Greenlet seem to be registered, but the function self.ff.do_url(url) doesn't seem to be run at all. What am I doing wrong?
Fire and forget using the multiprocessing module:
def fire_and_forget(arg_one):
# do stuff
...
def main_function():
p = Process(target=fire_and_forget, args=(arg_one,))
# you have to set daemon true to not have to wait for the process to join
p.daemon = True
p.start()
return "doing stuff in the background"
here is sample code for thread based method invocation additionally desired threading.stack_size can be added to boost the performance.
import threading
import requests
#The stack size set by threading.stack_size is the amount of memory to allocate for the call stack in threads.
threading.stack_size(524288)
def alpha_gun(url, json, headers):
#r=requests.post(url, data=json, headers=headers)
r=requests.get(url)
print(r.text)
def trigger(url, json, headers):
threading.Thread(target=alpha_gun, args=(url, json, headers)).start()
url = "https://raw.githubusercontent.com/jyotiprakash-work/Live_Video_steaming/master/README.md"
payload="{}"
headers = {
'Content-Type': 'application/json'
}
for _ in range(10):
print(i)
#for condition
if i==5:
trigger(url=url, json =payload, headers=headers)
print('invoked')
You want to use the threading module or the multiprocessing module and save the result either in database, a file or a queue.
You also can use gevent.