Why my algorithm goes more and more slowly?

Why my algorithm goes more and more slowly? - python

Premise: I'm not a programmer!
I've written a Python code to perform iterations on a DB's field with Arcgis 9.2 geoprocessor.
The algorithm has to iterate on more than a two thousand of records and it's speed decreases progressively untill 5/6 minute for 1 iteration!
This is my code:
# Import system modules
import sys, string, os, arcgisscripting
# Create the Geoprocessor object
gp = arcgisscripting.create()
# Set the necessary product code
gp.SetProduct("ArcEditor")
# Load required toolboxes...
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Data Management Tools.tbx")
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Analysis Tools.tbx")
gp.OverWriteOutput = "True"
# Global variables...
PP_ID_2012_dbf="C:\\...\\4Paths_PP_ID.dbf"
Paths_2012 = "C:\\...\\Paths_2012.shp"
# Search Cursor
src = gp.SearchCursor(PP_ID_2012_dbf)
row = src.Next()
src.Reset()
# While cycle
while row:
SQL_expr = row.GetValue("Query") # Query SQL
print row.GetValue("Query")
Selected_Path = "C:\\...\\Path_"+str(row.GetValue("PointPath_"))+".shp"
# Process: Select...
gp.Select_analysis(Paths_2012, Selected_Path, SQL_expr)
Paths_2012_Select_Simplify_shp="C:\\...\\Path_"+str(row.GetValue("PointPath_"))+"_Simplify.shp"
# Process: Simplify Lines...
gp.SimplifyLine_management(Selected_Path, Paths_2012_Select_Simplify_shp, "POINT_REMOVE", "20 Meters", "FLAG_ERRORS", "KEEP_COLLAPSED_POINTS", "CHECK")
del SQL_expr
del Selected_Path
del Paths_2012_select_Simplify_shp
row = src.Next()
What's wrong? I think it's a problem of cache memory but I'm not able to solve on my own.
Please, help me!

If you're performing a SQL query inside of a loop, it's plausible that you could make your code much faster by figuring out how to get all of the data in one query and then iterating through the result of that query to process it.
As someone else mentioned, you can figure out where you're spending your time by profiling your code, but DB calls are a common culprit, and if you can avoid making your DB calls inside of a loop, you may be able to reduce 2,000 db calls to one or two db calls.

Related

(BigQuery PY Client Library v0.28) - Fetch result from table 'query' job

I'm learning BigQuery API using Python Client Libraries v0.28
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#run-a-simple-query
Wrote this simple code to fetch data from the table
1) Create client object
client_ = bigquery.Client.from_service_account_json('/Users/xyz/key.json')
2) Begin new Async query job
QUERY = 'SELECT visitid FROM `1234567.ga_sessions_20180101`'
query_job = client_.query(QUERY
, job_id=str(uuid.uuid4()))
3) poll until the query is DONE
while (query_job.state == 'RUNNING'):
time.sleep(5)
query_job.reload()
4) Fetch the results in iteration
query_job.reload()
iter = query_job.result()
At this stage I'd like to fetch how many rows are in the table. As per the doc GitHub code iter is of type bigquery.table.RowIterator with a property [tier.total_rows][1]
5) However, at this stage when I print:
print(iter.total_rows)
It keeps returning None
I'm pretty sure this table is NOT empty an dry query is correctly formatted!
Any help to any pointers what am I missing here will be really helpful... Thanks a lot!
Cheers!

You need to also check query_job.error_result to make sure query succeeded.
You can also see your job in the UI, which can be useful for debugging, using project id and job id:
https://bigquery.cloud.google.com/results/projectid:jobid
Also, query_job.result() already waits for the job completion so you don't need to poll.

The current behavior of how RowIterator returns None is indeed perplexing. Luckily, according to this issue, tswast's comment from 10 days ago indicates that the developers are working on a better solution.
Current awkward behavior of .total_rows
Currently, .total_rows is initialized only once iteration begins. (In what follows, for clarity I renamed your iter variable to row_iter.)
row_iter = query_job.result()
itr = iter(row_iter)
first_row = next(itr)
print(row_iter.total_rows) # Now you get a number instead of None.
This is ugly because to continue the iteration, we must either handle the first row differently or call row_iter = query_job.result() again.
Temporary workaround
A currently-working alternative is to use the value of query_job._query_results.total_rows. Unfortunately this is cheating because _query_results is private, so there is no reason to expect that this will work in the future.
Future behavior
If tswast's proposal is implemented, then row_iter.total_rows will be initialized at the beginning, just as you expect.
Suggestion
In my code, I'm going to use something like
try:
num_rows = row_iter.total_rows or query_job._query_results.total_rows
except NameError:
num_rows = None
to be compatible with future behavior while falling-back to the temporary workaround if necessary.

Multiprocessing to speed up for loop

Just trying to learn and I"m wondering if multiprocessing would speed
up this for loop ,.. trying to compare
alexa_white_list(1,000,000 lines) and
dnsMISP (can get up to 160,000 lines)
Code checks each line in dnsMISP and looks for it in alexa_white_list.
if it doesn't see it, it adds it to blacklist.
Without mp_handler function the code works fine but it takes
around 40-45 minutes. For brevity, I've omitted all the other imports and
the function that pulls down and unzips the alexa white list.
The below gives me the following error -
File "./vetdns.py", line 128, in mp_handler
p.map(dns_check,dnsMISP,alexa_white_list)
NameError: global name 'dnsMISP' is not defined
from multiprocessing import Pool
def dns_check():
awl = []
blacklist = []
ctr = 0
dnsMISP = open(INPUT_FILE,"r")
dns_misp_lines = dnsMISP.readlines()
dnsMISP.close()
alexa_white_list = open(outname, 'r')
alexa_white_list_lines = alexa_white_list.readlines()
alexa_white_list.close()
print "converting awl to proper format"
for line in alexa_white_list_lines:
awl.append(".".join(line.split(".")[-2:]).strip())
print "done"
for host in dns_misp_lines:
host = host.strip()
host = ".".join(host.split(".")[-2:])
if not host in awl:
blacklist.append(host)
file_out = open(FULL_FILENAME,"w")
file_out.write("\n".join(blacklist))
file_out.close()
def mp_handler():
p = Pool(2)
p.map(dns_check,dnsMISP,alexa_white_list)
if __name__ =='__main__':
mp_handler()
If I label it as global etc I still get the error. I'd appreciate any
suggestions!!

There's no need for multiprocessing here. In fact this code can be greatly simplified:
def get_host_form_line(line):
return line.strip().split(".", 1)[-1]
def dns_check():
with open('alexa.txt') as alexa:
awl = {get_host_from_line(line) for line in alexa}
blacklist = []
with open(INPUT_FILE, "r") as dns_misp_lines:
for line in dns_misp_lines:
host = get_host_from_line(line)
if host not in awl:
blacklist.append(host)
with open(FULL_FILENAME,"w") as file_out:
file_out.write("\n".join(blacklist))
Using a set comprehension to create your Alexa collection has the advantage of being O(1) lookup time. Sets are similar to dictionaries. They are pretty much dictionaries that only have keys with no values. There is some additional overhead in memory and the initial creation time will likely be slower since the values you put in to a set need to be hashed and hash collisions dealt with but the increase in performance you gain from the faster look ups should make up for it.
You can also clean up your line parsing. split() takes an additional parameter that will limit the number of times the input is split. I'm assuming your lines look something like this:
http://www.something.com and you want something.com (if this isn't the case let me know)
It's important to remember that the in operator isn't magic. When you use it to check membership (is an element in the list) what it's essentially doing under the hood is this:
for element in list:
if element == input:
return True
return False
So every time in your code you did if element in list your program had to iterate across each element until it either found what you were looking for or got to the end. This was probably the biggest bottleneck of your code.

You tried to read a variable named dnsMISP to pass as an argument to Pool.map. It doesn't exist in local or global scope (where do you think it's coming from?), so you got a NameError. This has nothing to do with multiprocessing; you could just type a line with nothing but:
dnsMISP
and have the same error.

How to improve performance of a script operating on large amount of data?

My machine learning script produces a lot of data (millions of BTrees contained in one root BTree) and store it in ZODB's FileStorage, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.
When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).
I tried setting cache_size to various values between 1000 and 50000. To be honest, the differences in speed were negligible.
I thought of switching to RelStorage but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.
I wonder if RelStorage would be faster in my case.
Here's how I currently setup ZODB connection:
import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()
It's clear for me that ZODB is currently the bottleneck of my script.
I'm looking for advice on how I could solve this problem.
I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict.
Code and data structures:
root data structures:
if not hasattr(dbroot, 'actions_values'):
dbroot.actions_values = BTree()
if not hasattr(dbroot, 'games_played'):
dbroot.games_played = 0
actions_values is conceptually built as follows:
actions_values = { # BTree
str(state): { # BTree
# contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4)
# and their values(only actions previously taken by the angent are present here), e.g.:
1: 0.4356
5: 0.3456
},
# other states
}
state is a simple 2D array representing game board. Possible vales of it's fields are 1, 2 or None:
board = [ [ None ] * cols for _ in xrange(rows) ]
(in my case rows = 6 and cols = 7)
main loop:
should_play = 10000000
transactions_freq = 10000
packing_freq = 50000
player = ReinforcementPlayer(dbroot.actions_values, config)
while dbroot.games_played < should_play:
# max_epsilon at start and then linearly drops to min_epsilon:
epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1)
dbroot.games_played += 1
sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play))
sys.stdout.flush()
board_state = player.play_game(epsilon)
if(dbroot.games_played % transactions_freq == 0):
print('Commiting...')
transaction.commit()
if(dbroot.games_played % packing_freq == 0):
print('Packing DB...')
connection.db().pack()
(packing also takes much time but it's not the main problem; I could pack database after program finishes)
Code operating on dbroot (inside ReinforcementPlayer):
def get_actions_with_values(self, player_id, state):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
return self.actions_values[lookup_state_str]
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
return self.mirror_actions(self.actions_values[mirror_lookup_state_str])
return None
def get_value_of_action(self, player_id, state, action, default=0):
actions = self.get_actions_with_values(player_id, state)
if actions is None:
return default
return actions.get(action, default)
def set_value_of_action(self, player_id, state, action, value):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
self.actions_values[lookup_state_str][action] = value
return
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value
return
self.actions_values[lookup_state_str] = BTree()
self.actions_values[lookup_state_str][action] = value
(Functions with mirror in name simply reverse the columns (actions). It is done beacuse Connect 4 boards which are vertical reflections of each other are equivalent.)
After 550000 games len(dbroot.actions_values) is 6018450.
According to iotop IO operations take 90% of the time.

Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.
Some ideas what can be done:
Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.
Make the objects themselves smaller (Python classes have __slots__ trick)
Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.
Parallel processing - use all CPU cores instead of single threaded approach
Don't use BTrees - maybe there is something more efficient for your use case
Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.

Just to be clear here, which BTree class are you actually using? An OOBTree?
Two aspects about those btrees:
1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.
2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.
3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).
-Matt

Django find_each (like RoR)

Is there way to use find_each in django?
According to the rails documentation:
This method is only intended to use for batch processing of large
amounts of records that wouldn’t fit in memory all at once. If you
just need to loop over less than 1000 records, it’s probably better
just to use the regular find methods.
http://apidock.com/rails/ActiveRecord/Batches/ClassMethods/find_each
Thanks.

One possible solution could be to use the built-in Paginator class (could save a lot of hassle).
https://docs.djangoproject.com/en/dev/topics/pagination/
Try something like:
from django.core.paginator import Paginator
from yourapp.models import YourModel
result_query = YourModel.objects.filter(<your find conditions>)
paginator = Paginator(result_query, 1000) # the desired batch size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can add your required code
Or, you could use the limiting options as per needs to iterate over the results.

You can query parts of the whole table with a loop and by slicing the queryset.
If you're working with Debug = True it's important that you flush your queries after each loop since this can cause memory issues (Django stores all the queries that were run until the script finishes or dies).
If you need to restrict the results of the queryset you can replace ".all()" with the appropriate ".filter(conditions)"
from django import db
from myapp import MyModel
# Getting the total of records in the table
total_count = MyModel.objects.all().count()
chunk_size = 1000 # You can change this to any amount you can keep in memory
total_checked = 0
while total_checked < total_count:
# Querying all the objects and slicing only the part you need to work
# with at the moment (only that part will be loaded into memory)
query_set = MyModel.objects.all()[total_checked:total_checked + chunk_size]
for item in query_set:
# Do what you need to do with your results
pass
total_checked += chunk_size
# Clearing django's query cache to avoid a memory leak
db.reset_queries()

SQLAlchemy - select for update example

I'm looking for a complete example of using select for update in SQLAlchemy, but haven't found one googling. I need to lock a single row and update a column, the following code doesn't work (blocks forever):
s = table.select(table.c.user=="test",for_update=True)
# Do update or not depending on the row
u = table.update().where(table.c.user=="test")
u.execute(email="foo")
Do I need a commit? How do I do that? As far as I know you need to:
begin transaction
select ... for update
update
commit

If you are using the ORM, try the with_for_update function:
foo = session.query(Foo).filter(Foo.id==1234).with_for_update().one()
# this row is now locked
foo.name = 'bar'
session.add(foo)
session.commit()
# this row is now unlocked

Late answer, but maybe someone will find it useful.
First, you don't need to commit (at least not in-between queries, which I'm assuming you are asking about). Your second query hangs indefinitely, because you are effectively creating two concurrent connections to the database. First one is obtaining lock on selected records, then second one tries to modify locked records. So it can't work properly. (By the way in the example given you are not calling first query at all, so I'm assuming in your real tests you did something like s.execute() somewhere). So to the point—working implementation should look more like:
s = conn.execute(table.select(table.c.user=="test", for_update=True))
u = conn.execute(table.update().where(table.c.user=="test"), {"email": "foo"})
conn.commit()
Of course in such simple case there's no reason to do any locking but I guess it is example only and you were planning to add some additional logic between those two calls.

Yes, you do need to commit, which you can execute on the Engine or create a Transaction explicitely. Also the modifiers are specified in the values(...) method, and not execute:
>>> conn.execute(users.update().
... where(table.c.user=="test").
... values(email="foo")
... )
>>> my_engine.commit()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why my algorithm goes more and more slowly? - python

Related

(BigQuery PY Client Library v0.28) - Fetch result from table 'query' job

Multiprocessing to speed up for loop

How to improve performance of a script operating on large amount of data?

Django find_each (like RoR)

SQLAlchemy - select for update example

Categories

Resources