Preface -- I'm new to Python and programming in general. I am inheriting code and am attempting to modify existing code to make it more efficient.
Current issue is an existing "for server in serverlist" loop that appends the output to a list. It takes too long because it is processing the list one-by-one. It takes about 5seconds per machine lasting around 4minutes.
What do I need to do if I want to execute this single call on all 50 servers at once thereby getting back data for all 50 servers within a total of around 5 seconds? Or rather -- is there anything that can be done to speed this up in general to call on multiple machines at once instead of just one?
Current code example:
def dosomething(server):
stuff = []
try:
# function here
stuff = # function stuff here
except Exception:
pass
return stuff
def getdata(self):
serverlist = [servera, serverb, serverc]
data = []
for server in serverlist:
results = dosomething(server)
data.append(results)
return data
Related
import trio
work_available = trio.Event()
async def get_work():
while True:
work = check_for_work()
if not work:
await work_available.wait()
else:
return work
def add_work_to_pile(...):
...
if work_available.statistics().tasks_waiting:
global work_available
work_available.set()
work_available = trio.Event()
In this Python-like code example I get work in bursts via add_work_to_pile(). The workers which get work via get_work() are slow. So most of the time add_work_to_pile() is called there will be no one waiting on work_available.
Which is better/cleaner/simpler/more pythonic/more trionic/more intended by the trio developers?
checking if someone is looking for the Event() via statistics().tasks_waiting, like in the example code, ...or...
unconditionally set() setting the Event() and creating a new one each time? (Most of them in vain.)
Furthermore... the API does not really seem to expect regular code to check if someone is waiting via this statistics() call...
I don’t mind spending a couple more lines to make things clearer. But that goes both ways: a couple CPU cycles more are fine for simpler code...
Creating a new Event is roughly the same cost as creating the _EventStatistics object within the statistics method. You'll need to profile your own code to pick out any small difference in performance. However, although it is safe and performant, the intent of statistics across trio's classes is for debug rather than core logic. Using/discarding many Event instances would be relatively more along the intent of the devs.
A more trionic pattern would be to load each work item into a buffered memory channel in place of your add_work_to_pile() method and then iterate on that in the task that awaits get_work. I feel the amount of code is comparable to your example:
import trio
send_chan, recv_chan = trio.open_memory_channel(float('inf'))
async def task_that_uses_work_items():
# # compare
# while True:
# work = await get_work()
# handle_work(work)
async for work in recv_chan:
handle_work(work)
def add_work_to_pile():
...
for work in new_work_set:
send_chan.send_nowait(work)
# maybe your work is coming in from a thread?
def add_work_from_thread():
...
for work in new_work_set:
trio_token.run_sync_soon(send_chan.send_nowait, work)
Furthermore, it's performant because the work items are efficiently rotated through a deque internally. This code would checkpoint for every work item so you may have to jump through some hoops if you want to avoid that.
I think you might want a trio.ParkingLot. It gives more control over parking (i.e. which is like Event.wait()) and unparking (which is like Event.set() except that it doesn't stop future parkers from waiting). But it doesn't have any notion of being set at all so you would need to store that information separately. If you work is naturally Truety when set (e.g. a non-empty list) then that might be easy anyway. Example:
available_work = []
available_work_pl = trio.ParkingLot()
async def get_work():
while not available_work:
await available_work_pl.park()
result = list(available_work)
available_work.clear()
return result
def add_work_to_pile():
available_work.append(foo)
available_work_pl.unpark()
Edit: Replaced "if" with "while" in get_work(). I think if has a race condition: if there are two parked tasks and then add_work_to_pile() gets called twice, then one get_work() would get both work items but the other would still be unparked and return an empty list. Using while instead will make it loop back around until more data is added.
IMHO you don't want an event in the first place. The combination of an array and something that tells the reader there's work in the array is already available as memory channels. They have the additional advantage that you can tell them how much work to accept before the sender stalls.
send_channel, recv_channel = trio.open_memory_channel(10)
get_work = recv_channel.receive
add_work_to_pile = send_channel.send
# both are async functions or use the _nowait() versions
Here is part of my code, db.users_vi() is a list file. When the program comes to def viber_not, it starts working very slow, it send 1 message per 30 sec or even slower. How can I make it work faster, and why it's so slow?
def viber_not():
users = db.users_vi()
text = random.choice(texts)
for k in users:
try:
viber.send_messages(k[1], [TextMessage(text=text)])
except:
pass
Try to serialize data requested from DB. As most of them returns "cursor", not data. For some of them wrapping in list is enough, but look to documentation of that you use.
users = list(db.users_vi())
At the time I'm just doing a python myprogra.py & and let this program do its thing:
import urllib2
import threading
import json
url = 'https://something.com'
a = []
def refresh():
# refresh in 5 minutes
threading.Timer(300.0, refresh).start()
# open url
try:
data = urllib2.urlopen(url).read(1000)
except:
return 0
# decode json
q = data.decode('utf-8')
q = json.loads(q)
# store in a
a.append(q['ticker'])
if len(a) > 288:
a.pop()
truc = json.dumps(a)
f = open('ticker.json', 'w')
f.write(truc)
f.close()
refresh()
I have two questions:
how comes it work since I didn't write global a at the start of the function
should I use a cron for this kind of thing instead of what I'm doing? (I'm using a debian server)
There is no issue with accessing the variable a the way you do, because you never assign to it within the refresh function. It is accessed the very same way as the url variable or even the json import is accessed. If you were to assign to a (rather than calling a method such as append on it), then you would create a local variable shadowing the global a. The global keyword avoids the creation of a local variable for assignments.
It is up to you whether you use a program that sleeps or cron, but here are some things to keep in mind:
Your program keeps state across requests in the variable a. If you were to use cron and invoke your program multiple times, you would need to store this state somewhere else.
If your program crashes (e.g. invalid data is returned and json decoding fails with an exception), cron would start it again, so it would eventually recover. This may or may not be desired.
When run via cron, you lower the memory footprint of the system at the expense of more computation (Python interpreter being initialized every five minutes).
I'm making a program that (at least right now) retrives stream information from TwitchTV (streaming platform). This program is to self educate myself but when i run it, it's taking 2 minutes to print just the name of the streamer.
I'm using Python 2.7.3 64bit on Windows7 if that is important in anyway.
classes.py:
#imports:
import urllib
import re
#classes:
class Streamer:
#constructor:
def __init__(self, name, mode, link):
self.name = name
self.mode = mode
self.link = link
class Information:
#constructor:
def __init__(self, TWITCH_STREAMS, GAME, STREAMER_INFO):
self.TWITCH_STREAMS = TWITCH_STREAMS
self.GAME = GAME
self.STREAMER_INFO = STREAMER_INFO
def get_game_streamer_names(self):
"Connects to Twitch.TV API, extracts and returns all streams for a spesific game."
#start connection
self.con = urllib2.urlopen(self.TWITCH_STREAMS + self.GAME)
self.info = self.con.read()
self.con.close()
#regular expressions to get all the stream names
self.info = re.sub(r'"teams":\[\{.+?"\}\]', '', self.info) #remove all team names (they have the same name: parameter as streamer names)
self.streamers_names = re.findall('"name":"(.+?)"', self.info) #looks for the name of each streamer in the pile of info
#run in a for to reduce all "live_user_NAME" values
for name in self.streamers_names:
if name.startswith("live_user_"):
self.streamers_names.remove(name)
#end method
return self.streamers_names
def get_streamer_mode(self, name):
"Returns a streamers mode (on/off)"
#start connection
self.con = urllib2.urlopen(self.STREAMER_INFO + name)
self.info = self.con.read()
self.con.close()
#check if stream is online or offline ("stream":null indicates offline stream)
if self.info.count('"stream":null') > 0:
return "offline"
else:
return "online"
main.py:
#imports:
from classes import *
#consts:
TWITCH_STREAMS = "https://api.twitch.tv/kraken/streams/?game=" #add the game name at the end of the link (space = "+", eg: Game+Name)
STREAMER_INFO = "https://api.twitch.tv/kraken/streams/" #add streamer name at the end of the link
GAME = "League+of+Legends"
def main():
#create an information object
info = Information(TWITCH_STREAMS, GAME, STREAMER_INFO)
streamer_list = [] #create a streamer list
for name in info.get_game_streamer_names():
#run for every streamer name, create a streamer object and place it in the list
mode = info.get_streamer_mode(name)
streamer_name = Streamer(name, mode, 'http://twitch.tv/' + name)
streamer_list.append(streamer_name)
#this line is just to try and print something
print streamer_list[0].name, streamer_list[0].mode
if __name__ == '__main__':
main()
the program itself works perfectly, just really slow
any ideas?
Program efficiency typically falls under the 80/20 rule (or what some people call the 90/10 rule, or even the 95/5 rule). That is, 80% of the time the program is actually running in 20% of the code. In other words, there is a good shot that your code has a "bottleneck": a small area of the code that is running slow, while the rest runs very fast. Your goal is to identify that bottleneck (or bottlenecks), then fix it (them) to run faster.
The best way to do this is to profile your code. This means you are logging the time of when a specific action occurs with the logging module, use timeit like a commenter suggested, use some of the built-in profilers, or simply print out the current time at very points of the program. Eventually, you will find one part of the code that seems to be taking the most amount of time.
Experience will tell you that I/O (stuff like reading from a disk, or accessing resources over the internet) will take longer than in-memory calculations. My guess as to the problem is that you're using 1 HTTP connection to get a list of streamers, and then one HTTP connection to get the status of that streamer. Let's say that there are 10000 streamers: your program will need to make 10001 HTTP connections before it finishes.
There would be a few ways to fix this if this is indeed the case:
See if Twitch.TV has some alternatives in their API that allows you to retrieve a list of users WITH their streaming mode so that you don't need to call an API for each streamer.
Cache results. This won't actually help your program run faster the first time it runs, but you might be able to make it so that if it runs a second time within a minute, it can reuse results.
Limit your application to only dealing with a few streamers at a time. If there are 10000 streamers, what exactly does your application do that it really needs to look at the mode of all 10000 of them? Perhaps it's better to just grab the top 20, at which point the user can press a key to get the next 20, or close the application. Often times, programming is not just about writing code, but managing expectations of what your users want. This seems to be a pet project, so there might not be "users", meaning you have free reign to change what the app does.
Use multiple connections. Right now, your app makes one connection to the server, waits for the results to come back, parses the results, saves it, then starts on the next connection. This process might take an entire half a second. If there were 250 streamers, running this process for each of them would take a little over two minutes total. However, if you could run four of them at a time, you could potentially reduce your time to just under 30 seconds total. Check out the multiprocessing module. Keep in mind that some APIs might have limits to how many connections you can make at a certain time, so hitting them with 50 connections at a time might irk them and cause them to forbid you from accessing their API. Use caution here.
You are using the wrong tool here to parse the json data returned by your URL. You need to use json library provided by default rather than parsing the data using regex.
This will give you a boost in your program's performance
Change the regex parser
#regular expressions to get all the stream names
self.info = re.sub(r'"teams":\[\{.+?"\}\]', '', self.info) #remove all team names (they have the same name: parameter as streamer names)
self.streamers_names = re.findall('"name":"(.+?)"', self.info) #looks for the name of each streamer in the pile of info
To json parser
self.info = json.loads(self.info) #This will parse the json data as a Python Object
#Parse the name and return a generator
return (stream['name'] for stream in data[u'streams'])
I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been "consumed" by the code which is using the generator, and if so, pause until it is consumed?
def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
if proxy is not None:
proxy = web.ProxyManager(proxy,delay=delay)
pool_size = len(pool_size.records)
work_pool = pool.Pool(pool_size)
partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
for result in work_pool.imap_unordered(partial_grab,urls):
if result:
yield result
run from:
if __name__ == '__main__':
links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/#href') if link.startswith('http') and 'reddit' not in link)
print '%s links' % len(links)
counter = 1
for url, data in multi_grab(links,pool_size=10):
print 'got', url, counter, len(data)
counter += 1
A generator simply yields values. There's no way for the generator to know what's being done with them.
But the generator also pauses constantly, as the caller does whatever it does. It doesn't execute again until the caller invokes it to get the next value. It doesn't run on a separate thread or anything. It sounds like you have a misconception about how generators work. Can you show some code?
The point of a generator in Python is to get rid of extra, unneeded objects after each iteration. The only time it will keep those extra objects (and thus extra ram) is when the objects are being referenced somewhere else (such as adding them to a list). Make sure you aren't saving these variables unnecessarily.
If you're dealing with multithreading/processing, then you probably want to implement a Queue that you could pull data from, keeping track of the number of tasks you're processing.
I think you may be looking for the yield function. Explained in another StackOverflow question: What does the "yield" keyword do in Python?
A solution could be to use a Queue to which the generator would add data, while another part of the code would get data from it and process it. This way you could ensure that there is no more than n items in memory at the same time.