I have a stream of links coming in, and I want to check them for rss every now and then. But when I fire off my get_rss() function, it blocks and the stream halts. This is unnecessary, and I'd like to just fire-and-forget about the get_rss() function (it stores its results elsewhere.)
My code is like thus:
self.ff.get_rss(url) # not async
print 'im back!'
(...)
def get_rss(url):
page = urllib2.urlopen(url) # not async
soup = BeautifulSoup(page)
I'm thinking that if I can fire-and-forget the first call, then I can even use urllib2 wihtout worrying about it not being async. Any help is much appreciated!
Edit:
Trying out gevent, but like this nothing happens:
print 'go'
g = Greenlet.spawn(self.ff.do_url, url)
print g
print 'back'
# output:
go
<Greenlet at 0x7f760c0750f0: <bound method FeedFinder.do_url of <rss.FeedFinder object at 0x2415450>>(u'http://nyti.ms/SuVBCl')>
back
The Greenlet seem to be registered, but the function self.ff.do_url(url) doesn't seem to be run at all. What am I doing wrong?
Fire and forget using the multiprocessing module:
def fire_and_forget(arg_one):
# do stuff
...
def main_function():
p = Process(target=fire_and_forget, args=(arg_one,))
# you have to set daemon true to not have to wait for the process to join
p.daemon = True
p.start()
return "doing stuff in the background"
here is sample code for thread based method invocation additionally desired threading.stack_size can be added to boost the performance.
import threading
import requests
#The stack size set by threading.stack_size is the amount of memory to allocate for the call stack in threads.
threading.stack_size(524288)
def alpha_gun(url, json, headers):
#r=requests.post(url, data=json, headers=headers)
r=requests.get(url)
print(r.text)
def trigger(url, json, headers):
threading.Thread(target=alpha_gun, args=(url, json, headers)).start()
url = "https://raw.githubusercontent.com/jyotiprakash-work/Live_Video_steaming/master/README.md"
payload="{}"
headers = {
'Content-Type': 'application/json'
}
for _ in range(10):
print(i)
#for condition
if i==5:
trigger(url=url, json =payload, headers=headers)
print('invoked')
You want to use the threading module or the multiprocessing module and save the result either in database, a file or a queue.
You also can use gevent.
Related
I want to make a script where there are a few functions, the first is the add_cart which by name attempts to add an item to cart using proper cookies/headers because I can get a response which I get that ["error"] to print the log which will say retrying cart but the script suddenly stops even if I put the add_cart() function on the bottom but I also want to use the datetime module to time.sleep(2) before running the add_cart() function so confused how I could get this all up and running, I attached images below with my code it currently gets a response because it's printing in the terminal but I want to find out what I said above. Thanks!
Image (All code with headers, cookies, and payload minimized)
https://i.imgur.com/2jGwAeA.png
Please inform if something is wrong or any way I can fix my formatting? Again all headers, cookies, payload, requests url and responses is right trying to fix my other errors though
this is the full code since the bot said add it:
import json
from datetime import datetime
import time
import os
cookies = {
}
headers = {
}
atcPayload = {
}
def add_cart(cookies, headers, atcPayload):
response = requests.post('apiurl', headers=headers, cookies=cookies, data=atcPayload)
data = response.json()
print("adding")
if data["error"] == 'true':
print("retrying cart")
#(cookies/stuff hidden because it's a private project but that isn't the issue anyways)
Seems to be another issue where it won't run now in the visual studio code terminal either :(
If I understand you problem correctly, you would like to implement a loop, where you call your function, then wait 2 seconds, and do it indefinetly.
*edit
I hope I get it right now :)
Modified the code based on your comments, now it wait until "ok" returns from your function.
import time
import sys
from datetime import datetime,timedelta
Start_Time = datetime.now()
# Your function
def add_cart():
# This is for make some timing response
global Start_Time
Return_Value = "retry cart"
Current_Time = datetime.now()
# Just print out the current time
print(Current_Time)
# Get 10 second delay on status change
if Current_Time>(Start_Time+timedelta(seconds=10)):
Return_Value = "ok"
return Return_Value
# The main function
def main():
# For the response from the function
Response = None
# This will make an infinite loop
while True:
# Call your function, it will return "rety cart" until 10 seconds not pass, then it returns "ok"
Response = add_cart()
print(Response)
# Wait 2 second
time.sleep(2)
# Check the response and break out from the while loop
if ("ok" == Response):
break
# This will run if you run your file, and not run if you import it (for later use)
if __name__ == "__main__":
try:
# Run the main function defined above
main()
# If you want to interrupt the script press CTRL+c and the below part will catch it
except KeyboardInterrupt:
print("Interrupted")
sys.exit(0)
I am trying to speed up web scraping by running my http requests in a ThreadPoolExecutor from the concurrent.futures library.
Here is the code:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
urls = [
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=ibfxcfd&showcategories=CFD',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=chix_ca',
'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=tase',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=chixen-be&showcategories=STK',
'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=bvme&showcategories=STK'
]
def get_url(url):
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
a = soup.select_one('a')
print(a)
with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
results = {executor.submit( get_url(url)) : url for url in urls}
for future in concurrent.futures.as_completed(results):
try:
pass
except Exception as exc:
print('ERROR for symbol:', results[future])
print(exc)
However when looking at how the scripts print in the CLI, it seems that the requests are sent in a blocking loop.
Additionaly if i run the code by using the below, i an see that it is taking roughly the same time.
for u in urls:
get_url(u)
I have add some success in implementing concurrency using that library before, and i am at loss regarding what is going wrong here.
I am aware of the existence of the asyncio library as an alternative, but I would be keen on using threading instead.
You're not actually running your get_url calls as tasks; you call them in the main thread, and pass the result to executor.submit, experiencing the concurrent.futures analog to this problem with raw threading.Thread usage. Change:
results = {executor.submit( get_url(url)) : url for url in urls}
to:
results = {executor.submit(get_url, url) : url for url in urls}
so you pass the function to call and its arguments to the submit call (which then runs them in threads for you) and it should parallelize your code.
I have a file with 100,000 URLs that I need to request then process. The processing takes a non-negligible amount of time compared to the request, so simply using multithreading seems to only give me a partial speed-up. From what I have read, I think using the multiprocessing module, or something similar, would offer a more substantial speed-up because I could use multiple cores. I'm guessing I want to use some multiple processes, each with multiple threads, but I'm not sure how to do that.
Here is my current code, using threading (based on What is the fastest way to send 100,000 HTTP requests in Python?):
from threading import Thread
from Queue import Queue
import requests
from bs4 import BeautifulSoup
import sys
concurrent = 100
def worker():
while True:
url = q.get()
html = get_html(url)
process_html(html)
q.task_done()
def get_html(url):
try:
html = requests.get(url, timeout=5, headers={'Connection':'close'}).text
return html
except:
print "error", url
return None
def process_html(html):
if html == None:
return
soup = BeautifulSoup(html)
text = soup.get_text()
# do some more processing
# write the text to a file
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=worker)
t.daemon = True
t.start()
try:
for url in open('text.txt'):
q.put(url.strip())
q.join()
except KeyboardInterrupt:
sys.exit(1)
If the file isn't bigger than your available memory, instead of opening it with the "open" method use mmap ( https://docs.python.org/3/library/mmap.html ). It will give the same speed as if you were working with memory and not a file.
with open("test.txt") as f:
mmap_file = mmap.mmap(f.fileno(), 0)
# code that does what you need
mmap_file.close()
After some painful attempts I wrote something like this:
urls=[
'http://localhost',
'http://www.baidu.com',
'http://www.taobao.com',
'http://www.163.com',
'http://www.sina.com',
'http://www.qq.com',
'http://www.jd.com',
'http://www.amazon.cn',
]
#tornado.gen.coroutine
def fetch_with_coroutine(url):
response=yield tornado.httpclient.AsyncHTTPClient().fetch(url)
print url,len(response.body)
raise tornado.gen.Return(response.body)
#tornado.gen.coroutine
def main():
for url in urls:
yield fetch_with_coroutine(url)
timestart=time.time()
tornado.ioloop.IOLoop.current().run_sync(main)
print 'async:',time.time()-timestart
but it's even a little slower than the synchronous code. In addition the order of output is always the same so I think it doesn't run asynchronously.
What's wrong with my code?
In main(), you're calling fetch_with_coroutine one at a time; the way you're using yield means that the second fetch can't start until the first is finished. Instead, you need to start them all first and wait for them with a single yield:
#gen.coroutine
def main():
# 'fetches' is a list of Future objects.
fetches = [fetch_with_coroutine(url) for url in urls]
# 'responses' is a list of those Futures' results
# (i.e. HTTPResponse objects).
responses = yield fetches
I have this simple code which fetches page via urllib:
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
Now I can read the result via result.read() but I was wondering if all this functionality can be done outside the for loop. Because other URLs to be fetched will wait until all the result has been processed.
I want to process result outside the for loop. Can this be done?
One of the ways to do this maybe to have result as a dictionary. What you can do is:
result = {}
for eachBrowser in browser_list:
result[eachBrowser]= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
and use result[BrowserName] outside the loop.
Hope this helps.
If you simply wants to access all results outside the loop just append all results to a array or dictionary as above answer.
Or if you trying to speed up your task try multithreading.
import threading
class myThread (threading.Thread):
def __init__(self, result):
threading.Thread.__init__(self)
self.result=result
def run(self):
// process your result(as self.result) here
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
myThread(result).start() // it will start processing result on another thread and continue loop without any waiting
Its a simple way of multithrading. It may break depending on your result processing. Consider reading the documentation and some examples before you try.
You can use threads for this:
import threading
import urllib2
from urlparse import urljoin
def worker(url):
res = urllib2.urlopen(url)
data = res.read()
res.close()
browser_list = ['Chrome', 'Mozilla', 'Safari', 'Internet Explorer', 'Opera']
user_string_url='http://www.useragentstring.com/'
for browser in browser_list:
url = urljoin(user_string_url, browser)
threading.Thread(target=worker,args=[url]).start()
# wait for everyone to complete
for thread in threading.enumerate():
if thread == threading.current_thread(): continue
thread.join()
Are you using python3?, if so, you can use futures for this task:
from urllib.request import urlopen
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
browser_list = ['Chrome','Mozilla','Safari','Internet+Explorer','Opera']
user_string_url = "http://www.useragentstring.com/pages/"
def process_request(url, future):
print("Processing:", url)
print("Reading data")
print(future.result().read())
with ThreadPoolExecutor(max_workers=10) as executor:
submit = executor.submit
for browser in browser_list:
url = urljoin(user_string_url, browser) + '/'
submit(process_request, url, submit(urlopen, url))
You could also do this with yield:
def collect_browsers():
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
yield eachBrowser, urllib2.urlopen(urljoin(user_string_url,eachBrowser))
def process_browsers():
for browser, result in collect_browsers():
do_something (result)
This is still a synchronous call (browser 2 will not fire until browser 1 is processed) but you can keep the logic for dealing with the results separate from the logic managing the connections. You could of course also use threads to handle the processing asynchronously with or without yield
Edit
Just re-read OP and should repeat that yield doesn't provide multi-threaded, asynchronous execution in case that was not clear in my first answer!