How to speed up insertion into Redis from SQL Query using Python - python

I have a SQL query I execute, and it comes into my Python program ~500ms (about 100k rows).
I want to quickly insert this into redis, but it currently takes ~6sec, even with piping.
pipe = r.pipeline()
for row in q:
pipe.zincrby(SKEY, row["name"], 1)
pipe.execute()
Is there a way to speed this up?

The problem is you insert a large number of items in a sorted set. Redis doc says that the time complexity of zincrby is O(log(N)) where N is the number of elements in the sorted set. So the more items you insert, the longer it takes. You probably should rethink the way you use Redis in this case. Maybe the sorted set is not the best answer to your use case.

In general there's no way to speed this up from redis's perspective, but there are two things you can do:
1 If keys repeat themselves, try reducing the number of rows by summing up the names before calling redis. i.e.:
d = dict()
for row in q:
name = row["name"]
d[name] = d.get(name, 0) + 1
and then if you have recurring ids, you'll make less queries in redis.
2 Another thing I would try it to call execute() every say 1000 or 5000 commands or so, that way redis would not be blocking for other callers while this is executed, and python itself would allocate less memory, which might speed things up.
e.g. (combined with the above):
d = dict()
for row in q:
name = row["name"]
d[name] = d.get(name, 0) + 1
pipe = r.pipeline()
for i, (k, v) in enumerate(d.iteritems()):
pipe.zincrby(SKEY, k, v)
if i > 0 and i % 5000 == 0:
pipe.execute()
pipe.execute()

Related

AIO Castle Cavalry - My code is too slow, is there a way I can shorten this?

So I am currently preparing for a competition (Australian Informatics Olympiad) and in the training hub, there is a problem in AIO 2018 intermediate called Castle Cavalry. I finished it:
input = open("cavalryin.txt").read()
output = open("cavalryout.txt", "w")
squad = input.split()
total = squad[0]
squad.remove(squad[0])
squad_sizes = squad.copy()
squad_sizes = list(set(squad))
yn = []
for i in range(len(squad_sizes)):
n = squad.count(squad_sizes[i])
if int(squad_sizes[i]) == 1 and int(n) == int(total):
yn.append(1)
elif int(n) == int(squad_sizes[i]):
yn.append(1)
elif int(n) != int(squad_sizes[i]):
yn.append(2)
ynn = list(set(yn))
if len(ynn) == 1 and int(ynn[0]) == 1:
output.write("YES")
else:
output.write("NO")
output.close()
I submitted this code and I didn't pass because it was too slow, at 1.952secs. The time limit is 1.000 secs. I wasn't sure how I would shorten this, as to me it looks fine. PLEASE keep in mind I am still learning, and I am only an amateur. I started coding only this year, so if the answer is quite obvious, sorry for wasting your time 😅.
Thank you for helping me out!
One performance issue is calling int() over and over on the same entity, or on things that are already int:
if int(squad_sizes[i]) == 1 and int(n) == int(total):
elif int(n) == int(squad_sizes[i]):
elif int(n) != int(squad_sizes[i]):
if len(ynn) == 1 and int(ynn[0]) == 1:
But the real problem is your code doesn't work. And making it faster won't change that. Consider the input:
4
2
2
2
2
Your code will output "NO" (with missing newline) despite it being a valid configuration. This is due to your collapsing the squad sizes using set() early in your code. You've thrown away vital information and are only really testing a subset of the data. For comparison, here's my complete rewrite that I believe handles the input correctly:
with open("cavalryin.txt") as input_file:
string = input_file.read()
total, *squad_sizes = map(int, string.split())
success = True
while squad_sizes:
squad_size = squad_sizes.pop()
for _ in range(1, squad_size):
try:
squad_sizes.remove(squad_size) # eliminate n - 1 others like me
except ValueError:
success = False
break
else: # no break
continue
break
with open("cavalryout.txt", "w") as output_file:
print("YES" if success else "NO", file=output_file)
Note that I convert all the input to int early on so I don't have to consider that issue again. I don't know whether this will meet AIO's timing constraints.
I can see some things in there that might be inefficient, but the best way to optimize code is to profile it: run it with a profiler and sample data.
You can easily waste time trying to speed up parts that don't need it without having much effect. Read up on the cProfile module in the standard library to see how to do this and interpret the output. A profiling tutorial is probably too long to reproduce here.
My suggestions, without profiling,
squad.remove(squad[0])
Removing the start of a big list is slow, because the rest of the list has to be copied as it is shifted down. (Removing the end of the list is faster, because lists are typically backed by arrays that are overallocated (more slots than elements) anyway, to make .append()s fast, so it only has to decrease the length and can keep the same array.
It would be better to set this to a dummy value and remove it when you convert it to a set (sets are backed by hash tables, so removals are fast), e.g.
dummy = object()
squad[0] = dummy # len() didn't change. No shifting required.
...
squad_sizes = set(squad)
squad_sizes.remove(dummy) # Fast lookup by hash code.
Since we know these will all be strings, you can just use None instead of a dummy object, but the above technique works even when your list might contain Nones.
squad_sizes = squad.copy()
This line isn't required; it's just doing extra work. The set() already makes a shallow copy.
n = squad.count(squad_sizes[i])
This line might be the real bottleneck. It's effectively a loop inside a loop, so it basically has to scan the whole list for each outer loop. Consider using collections.Counter for this task instead. You generate the count table once outside the loop, and then just look up the numbers for each string.
You can also avoid generating the set altogether if you do this. Just use the Counter object's keys for your set.
Another point unrelated to performance. It's unpythonic to use indexes like [i] when you don't need them. A for loop can get elements from an iterable and assign them to variables in one step:
from collections import Counter
...
count_table = Counter(squad)
for squad_size, n in count_table.items():
...
You can collect all occurences of the preferred number for each knight in a dictionary.
Then test if the number of knights with a given preferred number is divisible by that number.
with open('cavalryin.txt', 'r') as f:
lines = f.readlines()
# convert to int
list_int = [int(a) for a in lines]
#initialise counting dictionary: key: preferred number, item: empty list to collect all knights with preferred number.
collect_dict = {a:[] for a in range(1,1+max(list_int[1:]))}
print(collect_dict)
# loop though list, ignoring first entry.
for a in list_int[1:]:
collect_dict[a].append(a)
# initialise output
out='YES'
for key, item in collect_dict.items():
# check number of items with preference for number is divisilbe
# by that number
if item: # if list has entries:
if (len(item) % key) > 0:
out='NO'
break
with open('cavalryout.txt', 'w') as f:
f.write(out)

Optimising a Python script

I've been trying to complete the below task in Python:
http://codeforces.com/problemset/problem/4/C
I created a simple script for it as can be seen below, but it returns a runtime error for the 7th test. I believe this is due to perhaps the code is taking too long, so I require assistance optimising it. I have looked at map and filter commands and tried implementing them, without success.
a=int(input())
entered_usernames=[]
n=0
while n<a:
y=input()
entered_usernames.append(y)
n+=1
valid_usernames=[]
for i in entered_usernames:
if i not in valid_usernames:
valid_usernames.append(i)
print('OK')
else:
count=1
while i+str(count) in valid_usernames:
count+=1
valid_usernames.append(i+str(count))
print(i+str(count))
You can try changing valid_usernames to a set instead of a list.
For a list list_a operation x in list_a takes (on average) linear time.
For a set set_a operation x in set_a takes (on average) constant time.
(source: https://wiki.python.org/moin/TimeComplexity)
This simple change could improve runtime a bit.
What also strikes me as potentially very slow is this fragment:
while i+str(count) in valid_usernames:
count+=1
However, if you want to improve this, you need to think about using a completely different data structure.
Why don't you use a lookup dict with a counter and solve this in O(N) time?
total = int(input()) # get the first input (total usernames)
database = {} # our 'database' / lookup dict
candidates = [input() for _ in range(total)] # pick usernames from the input
for candidate in candidates: # loop through each candidate
if candidate in database: # already used, print with a counter
print(candidate + str(database[candidate]))
database[candidate] += 1 # increase the counter
else: # the candidate doesn't exist in the 'database'...
print("OK")
database[candidate] = 1 # initialize counter for the next time
Why don't you try
valid_usernames.append(i+str(valid_usernames.count(i)))
print(i+str(valid_usernames.count(i))

Get random record set with Django, what is affecting the performance?

It said that
Record.objects.order_by('?')[:n]
have performance issues, and recommend doing something like this: (here)
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
Since that, why not do it directly like this:
result = random.sample(Record.objects.all(),n)
I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?
================Edit 2013-5-12 23:21 UCT+8 ========================
I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G
System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6
What I did was:
Create an app.
build a simple model with a index and a CharField content, then Syncdb.
Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used.
Modify settings.py that Django can output log into console.
Insert rows into table, untill the number of the rows is what I want.
Visit the 3 views, note the SQL Query statement, SQL time, and the total time
repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m)
This is views.py:
def test1(request):
start = datetime.datetime.now()
result = Record.objects.order_by('?')[:20]
l = list(result) # Queryset是惰性的,强制将Queryset转为list
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))
def test2(request):
start = datetime.datetime.now()
sample = random.sample(xrange(Record.objects.count()),20)
result = [Record.objects.all()[i] for i in sample]
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
def test3(request):
start = datetime.datetime.now()
result = random.sample(Record.objects.all(),20)
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
As #Yeo said,result = random.sample(Record.objects.all(),n) is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n] always better then others, especially the table smaller then 1m rows. Here is the data:
and the charts:
So, what's happened?
In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n) actually did ths:
(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()
Every one is right. And it used 22 seconds. And
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
actually did ths:
(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines
As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?
My think is:
If there is some way can speed up the large index query,
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
should be the best. Except(!) the table is smaller then 1m rows.
The problem with .order_by(?) is that under the hood it does ORDER BY RAND() (or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.
On the other hand doing Record.objects.all() forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.
So that's the tradeoff.
Now this is a lot better:
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
simply because it avoids all the problems mentioned above (note that Record.objects.all()[i] gets translated to SELECT * FROM table LIMIT 1 OFFSET i, depending on DB).
However it may still be inefficient since .count might be slow (as usual: depends on DB).
Record.objects.count() gets translated into very light SQL Query.
SELECT COUNT(*) FROM TABLE
Record.objects.all()[0] is also translated into a very light SQL Query.
SELECT * FROM TABLE LIMIT 1
Record.objects.all() usually the results get slice off to increase the performance
SELECT * FROM table LIMIT 20; // or something similar
list(Record.objects.all()) will query all the data and put it into a list data structure.
SELECT * FROM TABLE
Thus, any time you convert a Queryset into a list, that's where the expensive happened
In your example, random.sample() will convert into a list. (If I'm not wrong).
Thus when you do result = random.sample(Record.objects.all(),n) it will do the Full Queryset and convert into a list and then random pick the list.
Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one

Python: How to speed up creating of objects?

I'm creating objects derived from a rather large txt file. My code is working properly but takes a long time to run. This is because the elements I'm looking for in the first place are not ordered and not (necessarily) unique. For example I am looking for a digit-code that might be used twice in the file but could be in the first and the last row. My idea was to check how often a certain code is used...
counter=collections.Counter([l[3] for l in self.body])
...and then loop through the counter. Advance: if a code is only used once you don't have to iterate over the whole file. However You are stuck with a lot of iterations which makes the process really slow.
So my question really is: how can I improve my code? Another idea of course is to oder the data first. But that could take quite long as well.
The crucial part is this method:
def get_pc(self):
counter=collections.Counter([l[3] for l in self.body])
# This returns something like this {'187':'2', '199':'1',...}
pcode = []
#loop through entries of counter
for k,v in counter.iteritems():
i = 0
#find post code in body
for l in self.body:
if i == v:
break
# find fist appearence of key
if l[3] == k:
#first encounter...
if i == 0:
#...so create object
self.pc = CodeCana(k,l[2])
pcode.append(self.pc)
i += 1
# make attributes
self.pc.attr((l[0],l[1]),l[4])
if v <= 1:
break
return pcode
I hope the code explains the problem sufficiently. If not, let me know and I will expand the provided information.
You are looping over body way too many times. Collapse this into one loop, and track the CodeCana items in a dictionary instead:
def get_pc(self):
pcs = dict()
pcode = []
for l in self.body:
pc = pcs.get(l[3])
if pc is None:
pc = pcs[l[3]] = CodeCana(l[3], l[2])
pcode.append(pc)
pc.attr((l[0],l[1]),l[4])
return pcode
Counting all items first then trying to limit looping over body by that many times while still looping over all the different types of items defeats the purpose somewhat...
You may want to consider giving the various indices in l names. You can use tuple unpacking:
for foo, bar, baz, egg, ham in self.body:
pc = pcs.get(egg)
if pc is None:
pc = pcs[egg] = CodeCana(egg, baz)
pcode.append(pc)
pc.attr((foo, bar), ham)
but building body out of a namedtuple-based class would help in code documentation and debugging even more.

How to get something random in datastore (AppEngine)?

Currently i'm using something like this:
images = Image.all()
count = images.count()
random_numb = random.randrange(1, count)
image = Image.get_by_id(random_numb)
But it turns out that the ids in the datastore on AppEngine don't start from 1.
I have two images in datastore and their ids are 6001 and 7001.
Is there a better way to retrieve random images?
The datastore is distributed, so IDs are non-sequential: two datastore nodes need to be able to generate an ID at the same time without causing a conflict.
To get a random entity, you can attach a random float between 0 and 1 to each entity on create. Then to query, do something like this:
rand_num = random.random()
entity = MyModel.all().order('rand_num').filter('rand_num >=', rand_num).get()
if entity is None:
entity = MyModel.all().order('rand_num').get()
Edit: Updated fall-through case per Nick's suggestion.
Another solution (if you don't want to add an additional property). Keep a set of keys in memory.
import random
# Get all the keys, not the Entities
q = ItemUser.all(keys_only=True).filter('is_active =', True)
item_keys = q.fetch(2000)
# Get a random set of those keys, in this case 20
random_keys = random.sample(item_keys, 20)
# Get those 20 Entities
items = db.get(random_keys)
The above code illustrates the basic method for getting only keys and then creating a random set with which to do a batch get. You could keep that set of keys in memory, add to it as you create new ItemUser Entities, and then have a method that returns a n random Entities. You'll have to implement some overhead to manage the memcached keys. I like this solution better if you're performing the query for random elements often (I assume using a batch get for n Entities is more efficient than a query for n Entities).
I think Drew Sears's answer above (attach a random float to each entity on create) has a potential problem: every item doesn't have an equal chance of getting picked. For example, if there are only 2 entities, and one gets a rand_num of 0.2499, and the other gets 0.25, the 0.25 one will get picked almost all the time. This might or might not matter to your application. You could fix this by changing the rand_num of an entity every time it is selected, but that means each read also requires a write.
And pix's answer will always select the first key.
Here's the best general-purpose solution I could come up with:
num_images = Image.all().count()
offset = random.randrange(0, num_images)
image = Image.all().fetch(1, offset)[0]
No additional properties needed, but the downside is that count() and fetch() both have performance implications if the number of Images is large.
Another (less efficient) method, which requires no setup:
query = MyModel.all(keys_only=True)
# query.filter("...")
selected_key = None
n = 0
for key in query:
if random.randint(0,n)==0:
selected_key = key
n += 1
# just in case the query is empty
if selected_key is None:
entry = None
else:
entry = MyModel.get(selected_key)

Categories

Resources