TLDR:
How do I efficiently determine a sequence of queries that give back every result from a directory of names, given that responses to each query are limited to a small fraction of the number of entries in the whole directory?
Goal
I have been tasked with scraping all information from a number of university directories. These directories consist of faculty and staff members and each person has information that I am interested in collecting (name, email address, title, department, etc.). For most directories, my goal is to get URLs which corresponds to each member of the directory, so that information about that person can be gathered individually. Thus, for my purposes, getting a list of every name in the directory is sufficient.
For some of these directories, I am required to make a search query that then returns some results (others display all results at once). Usually, I am given the option to search by one (or several) fields, including first name, last name, and department, among others. Unfortunately, queries often have a maximum results limit which prevents me from simply searching A, B, C, etc.
How are queries interpreted?
Across the board, all queries are case-insensitive. Note that directories variously interpret search queries. I have seen three interpretations:
Assume the following toy directory: ["Abby", "Abraham", "Alb", "Babbage"]
1. Implicit following wildcard: results that start with the query are returned
In this case, searching "ab" would return "Abby" and "Abraham" but not "Babbage".
2. Implicit double wildcard: results that contain the query are returned
Here, searching "ab" would return "Abby", "Abraham", and "Babbage".
3. Fuzzy matching: results that contain or are close to the query are returned
In this case, searching "ab" would return all four names.
Algorithm
From these interpretations, I designed an algorithm which assumes its queries are treated with implicit following wildcards. I chose this interpretation because, when the same queries are interpreted as having an implicit double wildcard or with fuzzy matching, the results will be a superset of the expected interpretation. Thus, the same algorithm could be applied to all situations.
Potential caveat: with the double wildcard or fuzzy matching interpretations, there will be many more results for each query, requiring many more queries to cover the whole directory at the same number of maximum results.
Algorithm specification
The algorithm proceeds as follows:
1. Set the last name query to be a.
2. Make the last name query.
2a. If not over the results limit, save the results to the results set and increment the last character of the last name query (e.g. go from a to b or apple to applf) and return to step 2 (note that this could induce a carrying process such as from azzz to b). If this increment overflows, the search is complete and you can should skip to step 4. If over the results limit, continue to step 2b.
2b. Set the first name query to be a.
2c. Query the first name and the last name at the same time.
2d. If not over the results limit, save the results to the results set, increment the last character of the first name query, and return to step 2c. If this increment overflows (e.g. if a first name query of z is under the results limit), set the first name query to be blank and continue to step 3. If over the results limit, add an a to the end of the first name query and return to step 2c.
3. Add an a to the end of the last name query and return to step 2.
4. Return the set of results.
Python implentation
Here is the above algorithm, implemented in Python pseudocode. I included helper functions such as make_query(), increment(), and append_a().
import string
alphabet = string.ascii_lowercase
names = get_random_names(n=10000)
results_limit = 25
def make_query(first="", last=""):
print("Querying for the following:")
print("first:", first)
print("last:", last)
results = set()
for n in names:
f, l = n
if f.lower().startswith(first) and l.lower().startswith(last):
results.add(n)
if len(results) > results_limit:
print("Too many results")
print()
return set(), True
else:
print("Success! This gave " + str(len(results)) + " results")
print()
return results, False
def increment(q):
ql = [chars.index(c) for c in q]
while ql[-1] == len(chars) - 1:
del ql[-1]
if len(ql) == 0:
return ql, True
ql[-1] += 1
return "".join([chars[i] for i in ql]), False
def append_a(q):
ql = [chars.index(c) for c in q]
ql.append(0)
return "".join([chars[i] for i in ql])
def search_directory(field="last", fixed_last=None):
all_results = set()
query = "a"
num_queries = 0
while True:
if field == "last":
query_results, over_limit = make_query(last=query)
num_queries += 1
elif field == "first":
query_results, over_limit = make_query(first=query, last=fixed_last)
num_queries += 1
if not over_limit:
all_results = all_results.union(query_results)
query, is_finished = increment(query)
if is_finished:
return all_results, num_queries
continue
elif over_limit and field == "last":
first_name_results, first_num_queries = search_directory(field="first", fixed_last=query)
num_queries += first_num_queries
all_results = all_results.union(first_name_results)
query = append_a(query)
results, num_queries = search_directory()
print(results)
print("Number of results:", len(results))
print("Number of entries in directory:", len(set(names)))
print("Accuracy:", str(len(results)/len(set(names))))
print("Number of queries:", num_queries)
print("Missed names:")
print(set(names) - set(results))
Search example
To help people understand this algorithm better, I am providing an example sequence of queries and responses. For brevity, assume the directory consists of the following names only (in (first, last) format):
[("aa", "bac"), ("aa", "bba"), ("aa", "aaa"), ("ab", "bc"), ("b", "bab"), ("ccc", "a")]
Additionally, the results limit will be two (searches can have at most two results). Finally, assume that our queries will be interpreted as having wildcards following them. Here are the queries the algorithm performs:
#
Last name query
First name query
Response
1
a
-
Under results limit
2
b
-
Over results limit
3
b
a
Over results limit
4
b
aa
Under results limit
5
b
ab
Under results limit
6
b
ac
Under results limit
7
b
b
Under results limit
8
b
c
Under results limit
9
ba
-
Under results limit
10
bb
-
Under results limit
11
bc
-
Under results limit
12
c
-
Under results limit
And the results:
{('aa', 'bba'), ('b', 'bab'), ('aa', 'bac'), ('aa', 'aaa'), ('ab', 'bc'), ('ccc', 'a')}
Issues
With these 12 queries, I was able to get every name in the database. However, I have concerns with the efficiency of the algorithm. I tested it on random subsets of names from the 2015 Facebook leak, and I was able to achieve 100% completeness or greater on databases of thousands of names. From what I can tell, it took as many as 18,000 queries to retrieve a database of 9,000 names and 240,000 queries to retrieve a database of 90,000 names.
This is not a desirable level of performance, given that many of the directories I need to run the algorithm on have on the order of 10,000 entries, and each query could take as much as a second or two. More problematically, when adapting this algorithm to use the double wildcard interpretation, it takes as many as 280,000 queries to recover an 8,000 entry database, which is clearly too many.
Is there a more efficient way for me to achieve full coverage, both in the case of the following wildcard and the double wildcard interpretation?
Problem restatement
How do I efficiently determine a sequence of queries that give back every result from a directory of names, given that responses to each query are limited to a small fraction of the number of entries in the whole directory?
Related
So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
s = Search(using = client, index = .
set_index).source(['metadata.Filename'])\
.query('match', Filename=(date))
total = s.count()
return total
I want to find the total number of instances where '20180511' appears in metadata.Filename in a particular index.
This query is returning a higher number of hits than I would have expected. The data format in metadata.Filename is GEOSCATCAT20180507_12+20180511_0900.V01.nc4. My date variable is in the format '20180511'.
I think the problem is match queries do the thing with scores, where they might return a hit even if it's not an exact match. I was wondering if you had any insight regarding this issue.
It said that
Record.objects.order_by('?')[:n]
have performance issues, and recommend doing something like this: (here)
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
Since that, why not do it directly like this:
result = random.sample(Record.objects.all(),n)
I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?
================Edit 2013-5-12 23:21 UCT+8 ========================
I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G
System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6
What I did was:
Create an app.
build a simple model with a index and a CharField content, then Syncdb.
Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used.
Modify settings.py that Django can output log into console.
Insert rows into table, untill the number of the rows is what I want.
Visit the 3 views, note the SQL Query statement, SQL time, and the total time
repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m)
This is views.py:
def test1(request):
start = datetime.datetime.now()
result = Record.objects.order_by('?')[:20]
l = list(result) # Queryset是惰性的,强制将Queryset转为list
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))
def test2(request):
start = datetime.datetime.now()
sample = random.sample(xrange(Record.objects.count()),20)
result = [Record.objects.all()[i] for i in sample]
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
def test3(request):
start = datetime.datetime.now()
result = random.sample(Record.objects.all(),20)
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
As #Yeo said,result = random.sample(Record.objects.all(),n) is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n] always better then others, especially the table smaller then 1m rows. Here is the data:
and the charts:
So, what's happened?
In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n) actually did ths:
(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()
Every one is right. And it used 22 seconds. And
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
actually did ths:
(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines
As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?
My think is:
If there is some way can speed up the large index query,
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
should be the best. Except(!) the table is smaller then 1m rows.
The problem with .order_by(?) is that under the hood it does ORDER BY RAND() (or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.
On the other hand doing Record.objects.all() forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.
So that's the tradeoff.
Now this is a lot better:
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
simply because it avoids all the problems mentioned above (note that Record.objects.all()[i] gets translated to SELECT * FROM table LIMIT 1 OFFSET i, depending on DB).
However it may still be inefficient since .count might be slow (as usual: depends on DB).
Record.objects.count() gets translated into very light SQL Query.
SELECT COUNT(*) FROM TABLE
Record.objects.all()[0] is also translated into a very light SQL Query.
SELECT * FROM TABLE LIMIT 1
Record.objects.all() usually the results get slice off to increase the performance
SELECT * FROM table LIMIT 20; // or something similar
list(Record.objects.all()) will query all the data and put it into a list data structure.
SELECT * FROM TABLE
Thus, any time you convert a Queryset into a list, that's where the expensive happened
In your example, random.sample() will convert into a list. (If I'm not wrong).
Thus when you do result = random.sample(Record.objects.all(),n) it will do the Full Queryset and convert into a list and then random pick the list.
Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one
I have ListProperty's in an entity that contains two time objects that represent a business' open and closing times for a day of the week:
mon_hours = db.ListProperty(datetime.time)
tue_hours = db.ListProperty(datetime.time)
wed_hours = db.ListProperty(datetime.time)
thu_hours = db.ListProperty(datetime.time)
fri_hours = db.ListProperty(datetime.time)
sat_hours = db.ListProperty(datetime.time)
sun_hours = db.ListProperty(datetime.time)
When I query this entity using the current time AND chain the filters to properly return only records where the list has a time greater and less than, it fails with 0 results:
now = datetime.datetime.now()
q = Place.all()
q.filter('mon_hours <=', now.time()).filter('mon_hours' >=', now.time())
However, when I remove one of the filters, it returns results, albiet the wrong ones:
now = datetime.datetime.now()
q = Place.all()
q.filter('mon_hours <=', now.time())
When I manually set the minutes to 00, it works for some reason:
q = Place.all()
q.filter('mon_hours <=', datetime.datetime(1970,1,1,10,00).time()).filter('mon_hours' >=', datetime.datetime(1970,1,1,10,00).time())
This last query is the desired results but the time needs to be the current time with arbitrary minutes.
WTF?!
Is this code you give exactly what you tried? Note that the datastore doesn't like range filters that indicate an empty range, and returns no results in that case -- so if e.g. you actually ran something like q.filter('a <', t).filter('a >=', t) that would explain your results.
My bad. I made the assumption that app engine worked on list properties like mongodb. If two inequality filter are applied to a list property, one value in the list has to match both. The successful results at 00 and 30 minute marks where artifacts of using >=, <= where one value was matching both.
Doh.
This is a query that totals up every players game results from a game and displays the players who match the conditions.
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses
from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(* ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
dotagames as dg,
games as ga,
dotaplayers as dp,
scores as sc
where dg.winner <> 0
and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h
order by totalscore desc
Now I'm not too sure what's the best way to go but what would in your opinion be to optimize this query?
I run a Q6600 # 2.4ghz, 4gb of ram, 64-bit Linux Ubuntu 9.04 system and this query can take up to 6.7 seconds to run (I do have a huge database).
Also I would like to paginate the results as well and executing extra conditions on top of this query is far too slow....
I use django as a frontend so any methods that include using python +/- django methods would be great. MySQL, Apache2 tweaks are also welcome. And of course, I'm open to changing the query to make it run faster.
Thanks for reading my question; look forward to reading your answers!
Edit: EXPLAIN QUERY RESULTS
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 783 Using filesort
2 DERIVED sc ALL name,name_2 NULL NULL NULL 2099 Using temporary; Using filesort
2 DERIVED gp ref gameid,colour,name name 17 development.sc.name 2
2 DERIVED ga eq_ref PRIMARY,id,id_2 PRIMARY 4 development.gp.gameid 1 Using index
2 DERIVED dg ref gameid,winner gameid 4 development.ga.id 1 Using where
2 DERIVED dp ref gameid_2,colour gameid_2 4 development.ga.id 10 Using where
First of all, the SQL is badly formatted. The most obvious error is the line splitting before each AS clause. Second obvious problem is using implicit joins instead of explicitly using INNER JOIN ... ON ....
Now to answer the actual question.
Without knowing the data or the environment, the first thing I'd look at would be some of the MySQL server settings, such as sort_buffer and key_buffer. If you haven't changed any of these, go read up on them. The defaults are extremely conservative and can often be raised more than ten times their default, particularly on the large iron like you have.
Having reviewed that, I'd be running pieces of the query to see speed and what EXPLAIN says. The effect of indexing can be profound, but MySQL has a "fingers-and-toes" problem where it just can't use more than one per table. And JOINs with filtering can need two. So it has to descend to a rowscan for the other check. But having said that, dicing up the query and trying different combinations will show you where it starts stumbling.
Now you will have an idea where a "tipping point" might be: this is where a small increase in some raw data size, like how much it needs to extract, will result in a big loss of performance as some internal structure gets too big. At this point, you will probably want to raise the temporary tables size. Beware that this kind of optimization is a bit of a black art. :-)
However, there is another approach: denormalization. In a simple implementation, regularly scheduled scripts will run this expensive query from time-to-time and poke the data into a separate table in a structure much closer to what you want to display. There are multiple variations of this approach. It can be possible to keep this up-to-date on-the-fly, either in the application, or using table triggers. At the other extreme, you could allow your application to run the expensive query occasionally, but cache the result for a little while. This is most effective if a lot of people will call it often: even 2 seconds cache on a request that is run 15 times a second will show a visible improvement.
You could find ways of producing the same data by running half-a-dozen queries that each return some of the data, and post-processing the data. You could also run version of your original query that returns more data (which is likely to be much faster because it does less filtering) and post-process that. I have found several times that five simpler, smaller queries can be much faster - an order of magnitude, sometimes two - than one big query that is trying to do it all.
No index will help you since you are scanning entire tables.
As your database grows the query will always get slower.
Consider accumulating the stats : after every game, insert the row for that game, and also increment counters in the player's row, Then you don't need to count() and sum() because the information is available.
select * is bad most times - select only the columns you need
break the select into multiple simple selects, use temporary tables when needed
the sum(case part could be done with a subselect
mysql has a very bad performance with or-expressions. use two selects which you union together
Small Improvement
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(1 ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
( select * from dotagames dg1 where dg.winner <> 0 ) as dg,
games as ga,
dotaplayers as dp,
scores as sc
where and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h order by totalscore desc
Changes:
1. count (*) chnaged to count(1)
2. In the FROM, The number of rows are reduced.