how can i speed up my code? - python

It's a program that suggests to the user a player's name if the user made a typo. It's extremely slow.
First it has to issue a get request, then checks to see if the player's name is within the json data, if it is, pass. Else, it takes all the players' first and last names and appends it to names. Then it checks whether the first_name and last_name closely resembles the names in the list using get_close_matches. I knew from the start this would be very slow, but there has to be a faster way to do this, it's just I couldn't come up with one. Any suggestions?
from difflib import get_close_matches
def suggestion(first_name, last_name):
names = []
my_request = get_request("https://www.mysportsfeeds.com/api/feed/pull/nfl/2016-2017-regular/active_players.json")
for n in my_request['activeplayers']['playerentry']:
if last_name == n['player']['LastName'] and first_name == n['player']['FirstName']:
pass
else:
names.append(n['player']['FirstName'] + " " + n['player']['LastName'])
suggest = get_close_matches(first_name + " " + last_name, names)
return "did you mean " + "".join(suggest) + "?"
print suggestion("mattthews ", "stafffford") #should return Matthew Stafford

Well, since it turned out my suggestion in the comments worked out, I might as well post it as an answer with some other ideas included.
First, take your I/O operation out of the function so that you're not wasting time making the request every time your function is run. Instead, you will get your json and load it into local memory when you start the script. If at all possible, downloading the json data beforehand and instead opening a text file might be a faster option.
Second, you should get a set of unique candidates per loop because there is no need to compare them multiple times. When a name is discarded by get_close_matches(), we know that same name does not need to be compared again. (It would be a different story if the criteria with which the name is being discarded depends on the subsequent names, but I doubt that's the case here.)
Third, try to work with batches. Given that get_close_matches() is reasonably efficient, comparing to, say, 10 candidates at once shouldn't be any slower than to 1. But reducing the for loop from going over 1 million elements to over 100K elements is quite a significant boost.
Fourth, I assume that you're checking for last_name == ['LastName'] and first_name == ['FirstName'] because in that case there would have been no typo. So why not simply break out of the function?
Putting them all together, I can write a code that looks like this:
from difflib import get_close_matches
# I/O operation ONCE when the script is run
my_request = get_request("https://www.mysportsfeeds.com/api/feed/pull/nfl/2016-2017-regular/active_players.json")
# Creating batches of 10 names; this also happens only once
# As a result, the script might take longer to load but run faster.
# I'm sure there is a better way to create batches, but I'm don't know any.
batch = [] # This will contain 10 names.
names = [] # This will contain the batches.
for player in my_request['activeplayers']['playerentry']:
name = player['FirstName'] + " " + player['LastName']
batch.append(name)
# Obviously, if the number of names is not a multiple of 10, this won't work!
if len(batch) == 10:
names.append(batch)
batch = []
def suggest(first_name, last_name, names):
desired_name = first_name + " " + last_name
suggestions = []
for batch in names:
# Just print the name if there is no typo
# Alternatively, you can create a flat list of names outside of the function
# and see if the desired_name is in the list of names to immediately
# terminate the function. But I'm not sure which method is faster. It's
# a quick profiling task for you, though.
if desired_name in batch:
return desired_name
# This way, we only match with new candidates, 10 at a time.
best_matches = get_close_matches(desired_name, batch)
suggestions.append(best_matches)
# We need to flatten the list of suggestions to print.
# Alternatively, you could use a for loop to append in the first place.
suggestions = [name for batch in suggestions for name in batch]
return "did you mean " + ", ".join(suggestions) + "?"
print suggestion("mattthews ", "stafffford") #should return Matthew Stafford

Related

Running the same program until condition satisfied

I am trying to create a small program that searches a folder of images, chooses one, checks its size and finishes if the chosen image is at least 5KB. If it is not then I need it to loop back to the choosing step (and then the size check, and so on..)
I am using functions for the choosing and the size-check but when I try to use them in a while loop I get all sorts of indentation errors and now I'm very confused. I've commented the section where I was using the function, but really I guess I want the whole thing to loop back, to the comment at the top..
Here's my code -
#CHOOSE POINT
def chosen():
random.choice(os.listdir(r"/Users/me/p1/images"))
def size():
os.path.getsize(r"/Users/me/p1/images/"+chosen)
thresh = 5000
while size < thresh:
print(chosen + " is too small")
# loop back to CHOOSE POINT
else:
print(chosen + " is at least 5KB")
Am I thinking about this all wrong? Will using the function in my while-loop do what I want? What's the best way to achieve what I'm trying to do? I'm quite new to this and getting very confused.
The first thing to realise is that code like this:
def chosen():
random.choice(os.listdir(r"/Users/me/p1/images"))
is only the definition of a function. It only runs each time you actually call it, with chosen().
Secondly, random.choice() will make a random choice from the list provided (although it's fairly inefficient to keep reading that from disk every time you call it, and it's unclear why you'd pick one at random, but that's OK), but since you don't actually return the value, the function isn't very useful. A choice is made and then discarded. Instead you probably wanted:
def chosen():
return random.choice(os.listdir(r"/Users/me/p1/images"))
Thirdly, this function definition:
def size():
os.path.getsize(r"/Users/me/p1/images/"+chosen)
It tries to use chosen, but that's just the name of a function you previously defined. You probably want get the size of an actual file that was chosen, which the function needs to be provided with as a parameter:
def size(fn):
return os.path.getsize(r"/Users/me/p1/images/"+fn)
Now to use those functions:
file_size = 0
threshold = 5000
while file_size < threshold:
a_file = chosen()
file_size = size(a_file)
if file_size < threshold:
print(a_file + " is too small")
else:
print(a_file + " is at least 5KB")
print('Done')
The variable file_size is initialised to 0, to make sure the loop starts. The loop will keep going until the condition is met at the start.
Every time, chosen() is executed, the returned value is remembers as the variable a_file, which you can then use in the rest of the code to refer back to.
It then gets passed to size(), to obtain a size and finally, the test is performed to print the right message.
A more efficient way to achieve the same:
threshold = 5000
while True:
a_file = chosen()
file_size = size(a_file)
if file_size < threshold:
print(a_file + " is too small")
else:
break
print(a_file + " is at least 5KB")
The break just exits the while loop which would keep going forever since it tests for True. This avoid testing the same thing twice.
So, you'd end up with:
import random
import os
def chosen():
return random.choice(os.listdir(r"/Users/me/p1/images/"))
def size(fn):
return os.path.getsize(r"/Users/me/p1/images/"+fn)
threshold = 5000
while True:
a_file = chosen()
file_size = size(a_file)
if file_size < threshold:
print(a_file + " is too small")
else:
break
print(a_file + " is at least 5KB")

How can i optimize my solution not to exceed time limit on my task?

I wrote a program that receives name from user checks if it is already taken in database and if it's not prints "OK". If name is taken program must make new name using old name + number. I keep getting "time limit exceed" error but i don't know what's wrong. I am new to programming so do not judge me strictly.
Here is my code:
n = int(input())
names = []
def CheckDB(name):
for i in names:
if i == name:
return(True)
return(False)
def MakeNewName(name, number):
while CheckDB(name+str(number)):
number+=1
newName = name+str(number)
names.append(newName)
return(newName)
def CreateNewUser(name):
if CheckDB(name):
return(MakeNewName(name, 1))
names.append(name)
return("OK")
for i in range (n):
name = input()
print(CreateNewUser(name))
Input looks like this:
100000
hgtyyvplfrlcr
dcvexvhgtyyvplfrlcryws
hmidcvexvhgtyyvplfrlcryw
vexvhgtyyv
idcvexvhgtyyv
vhgt
midcvexvhgtyyvplfrlcry
yv
lfrl
gtyyvplfrlcryw
xvhgtyyvplfrlcryws
yv
midcvexvhgtyyvplfrlcry
hmidcve
vexvhgtyyv
dcvexvhgtyy
midcvexvhgty
id
xvhgtyyvpl
midcvexvhgtyyvplfrlc
idcvexvhgtyyvplfr
idcvexvhgtyyvplfrl
dcvexvhgtyyv
midcv
midcvexvhgt
idcvexvhgtyyvplfrlcr
midcvexvhgtyy
yvplfrlcryw
midcvexv
l
dcvexvhgtyy
dcv
midcvexvhgtyyvplfrlc
vexvhgtyyvplfrlcry
yvpl
hmidcvexvhgtyyvplfr
And so on
p.s. sorry for my bad English
Python has no built-in 'time limit exceeded' error and your code doesn't show any time limit or what is being timed, so it's hard to say exactly what is going on, but note that time taken by naive linear search of a list grows linearly with the length of the list, and you're doing it many times for names already taken. If you instead use a set to store names, you can check if a name is taken using name in names and this will always take a constant amount of time no matter how large your 'database' grows.
Keep in mind this wouldn't matter if you were using an actual database, as the underlying database engine would handle efficiently indexing primary key columns for you.

How can I convert a result into a list of variables that I can use as an input?

I was able to come up with these two parts, but I'm having trouble linking them.
Part 1 - This accepts a filter which is listed as 'project = status = blocked'. This will list all issue codes that match the filter and separate them line by line. Is it necessary to convert the results into a list? I'm also wondering if it converts the entire result into one massive string or if each line is a string.
issues_in_project = jira.search_issues(
'project = status = Blocked'
)
issueList = list(issues_in_project)
search_results = '\n'.join(map(str, issueList))
print(search_results)
Part 2 - Right now, the jira.issue will only accept an issue code one at a time. I would like to use the list generated from Part 1 to keep running the code below for each and every issue code in the result. I'm having trouble linking these two parts.
issue = jira.issue(##Issue Code goes here##)
print(issue.fields.project.name)
print(issue.fields.summary + " - " + issue.fields.status.statusCategory.name)
print("Description: " + issue.fields.description)
print("Reporter: " + issue.fields.reporter.displayName)
print("Created on: " + issue.fields.created)
Part 1
'project = status = Blocked' is not a valid JQL. So first of all, you will not get a valid result from calling jira.search_issues('project = status = Blocked').
The result of jira.search_issues() is basically a list of jira.resources.Issue objects and not a list of string or lines of string. To be correct, I should say the result of jira.search_issues() is of type jira.client.ResultList, which is a subclass of python's list.
Part 2
You already have all the required data in issues_in_project if your JQL is correct. Therefore, you can loop through the list and use the relevant information of each JIRA issue. For your information, jira.issue() returns exactly one jira.resources.Issue object (if the issue key exists).
Example
... # initialize jira
issues_in_project = jira.search_issues('status = Blocked')
for issue in issues_in_project:
print(issue.key)
print(issue.fields.summary)

Python Printing on the same Line

The problem that I have is printing phone_sorter() and number_calls() all on the same lines. For instance it will print the two lines of phone_sorter but the number_calls will be printed right below it. I have tried the end='' method but it does not seem to work.
customers=open('customers.txt','r')
calls=open('calls.txt.','r')
def main():
print("+--------------+------------------------------+---+---------+--------+")
print("| Phone number | Name | # |Duration | Due |")
print("+--------------+------------------------------+---+---------+--------+")
print(phone_sorter(), number_calls())
def time(x):
m, s = divmod(seconds, x)
h, m = divmod(m, x)
return "%d:%02d:%02d" % (h, m, s)
def phone_sorter():
sorted_no={}
for line in customers:
rows=line.split(";")
sorted_no[rows[1]]=rows[0]
for value in sorted(sorted_no.values()):
for key in sorted_no.keys():
if sorted_no[key] == value:
print(sorted_no[key],key)
def number_calls():
no_calls={}
for line in calls:
rows=line.split(";")
if rows[1] not in no_calls:
no_calls[rows[1]]=1
else:
no_calls[rows[1]]+=1
s={}
s=sorted(no_calls.keys())
for key in s:
print(no_calls[key])
main()
Your key problem is that both phone_sorter and number_calls do their own printing, and return None. So, printing their return values is absurd and should just end with a None None line that makes no sense, after they've done all their own separate-line printing.
A better approach is to restructure them to return, not print, the strings they determine, and only then arrange to print those strings with proper formatting in the "orchestrating" main function.
It looks like they'll each return a list of strings (which they are now printing on separate lines) and you'll likely want to zip those lists if they are in corresponding order, to prepare the printing.
But your code is somewhat opaque, so it's hard to tell if the orders of the two are indeed corresponding. They'd better be, if the final printing is to make sense...
Added: let me exemplify with some slight improvement and one big change in phone_sorter...:
def phone_sorter():
sorted_no={}
for line in customers:
rows=line.split(";")
sorted_no[rows[1]]=rows[0]
sorted_keys = sorted(sorted_no, key=sorted_no.get)
results = [(sorted_no[k], k) for k in sorted_keys]
return results
Got it? Apart from doing the computations better, the core idea is to put together a list and return it -- it's main's job to format and print it appropriately, in concert with a similar list returned by number_calls (which appears to be parallel).
def number_calls():
no_calls=collections.Counter(
line.split(';')[1] for line in calls)
return [no_calls(k) for k in sorted(no_calls)]
Now the relationship between the two lists is not obvious to me, but, assuming they're parallel, main can do e.g:
nc = no_calls()
ps = phone_sorter()
for (duration, name), numcalls in zip(ps, nc):
print(...however you want to format the fields here...)
Those headers you printed in main don't tell me what data should be printed under each, and how the printing should be formatted (width of
each field, for example). But, main, and only main, should be
intimately familiar with these presentation issues and control them, while the other functions deal with the "business logic" of extracting the data appropriately. "Separation of concerns" -- a big issue in programming!

python loop optimzation - iterate dirs 3 levels and delete

Hi I have the following procedure,
Questions:
- How to make it elegant, more readable, compact.
- What can I do to extract common loops to another method.
Assumptions:
From a given rootDir the dirs are organized as in ex below.
What the proc does:
If input is 200, it deletes all DIRS that are OLDER than 200 days. NOT based on modifytime, but based on dir structure and dir name [I will later delete by brute force "rm -Rf" on each dir that are older]
e.g dir structure:
-2009(year dirs) [will force delete dirs e.g "rm -Rf" later]
-2010
-01...(month dirs)
-05 ..
-01.. (day dirs)
-many files. [I won't check mtime at file level - takes more time]
-31
-12
-2011
-2012 ...
Code that I have:
def get_dirs_to_remove(dir_path, olderThanDays):
today = datetime.datetime.now();
oldestDayToKeep = today + datetime.timedelta(days= -olderThanDays)
oldKeepYear = int(oldestDayToKeep.year)
oldKeepMonth =int(oldestDayToKeep.month);
oldKeepDay = int(oldestDayToKeep.day);
for yearDir in os.listdir(dirRoot):
#iterate year dir
yrPath = os.path.join(dirRoot, yearDir);
if(is_int(yearDir) == False):
problemList.append(yrPath); # can't convery year to an int, store and report later
continue
if(int(yearDir) < oldKeepYear):
print "old Yr dir: " + yrPath
#deleteList.append(yrPath); # to be bruteforce deleted e.g "rm -Rf"
yield yrPath;
continue
elif(int(yearDir) == oldKeepYear):
# iterate month dir
print "process Yr dir: " + yrPath
for monthDir in os.listdir(yrPath):
monthPath = os.path.join(yrPath, monthDir)
if(is_int(monthDir) == False):
problemList.append(monthPath);
continue
if(int(monthDir) < oldKeepMonth):
print "old month dir: " + monthPath
#deleteList.append(monthPath);
yield monthPath;
continue
elif (int(monthDir) == oldKeepMonth):
# iterate Day dir
print "process Month dir: " + monthPath
for dayDir in os.listdir(monthPath):
dayPath = os.path.join(monthPath, dayDir)
if(is_int(dayDir) == False):
problemList.append(dayPath);
continue
if(int(dayDir) < oldKeepDay):
print "old day dir: " + dayPath
#deleteList.append(dayPath);
yield dayPath
continue
print [ x for x in get_dirs_to_remove(dirRoot, olderThanDays)]
print "probList" % problemList # how can I get this list also from the same proc?
This actually looks pretty nice, except for the one big thing mentioned in this comment:
print "probList" % problemList # how can I get this list also from the same proc?
It sounds like you're storing problemList in a global variable or something, and you'd like to fix that. Here are a few ways to do this:
Yield both delete files and problem files—e.g., yield a tuple where the first member says which kind it is, and the second what to do with it.
Take the problemList as a parameter. Remember that lists are mutable, so appending to the argument will be visible to the caller.
yield the problemList at the end—which means you need to restructure the way you use the generator, because it's no longer just a simple iterator.
Code the generator as a class instead of a function, and store problemList as a member variable.
Peek at the internal generator information and cram problemList in there, so the caller can retrieve it.
Meanwhile, there are a few ways you could make the code more compact and readable.
Most trivially:
print [ x for x in get_dirs_to_remove(dirRoot, olderThanDays)]
This list comprehension is exactly the same as the original iteration, which you can write more simply as:
print list(get_dirs_to_remove(dirRoot, olderThanDays))
As for the algorithm itself, you could partition the listdir, and then just use the partitioned lists. You could do it lazily:
yearDirs = os.listdir(dirRoot):
problemList.extend(yearDir for yearDir in yearDirs if not is_int(yearDir))
yield from (yearDir for yearDir in yearDirs if int(yearDir) < oldKeepYear)
for year in (yearDir for yearDir in yearDirs if int(yearDir) == oldKeepYear):
# next level down
Or strictly:
yearDirs = os.listdir(dirRoot)
problems, older, eq, newer = partitionDirs(yearDirs, oldKeepYear)
problemList.extend(problems)
yield from older
for year in eq:
# next level down
The latter probably makes more sense, especially given that yearDirs is already a list, and isn't likely to be that big anyway.
Of course you need to write that partitionDirs function—but the nice thing is, you get to use it again in the months and days levels. And it's pretty simple. In fact, I might actually do the partitioning by sorting, because it makes the logic so obvious, even if it's more verbose:
def partitionDirs(dirs, keyvalue):
problems = [dir for dir in dirs if not is_int(dir)]
values = sorted(dir for dir in dirs if is_int(dir), key=int)
older, eq, newer = partitionSortedListAt(values, keyvalue, key=int)
If you look around (maybe search "python partition sorted list"?), you can find lots of ways to implement the partitionSortedListAt function, but here's a sketch of something that I think is easy to understand for someone who hasn't thought of the problem this way:
i = bisect.bisect_right(vals, keyvalue)
if vals[i] == keyvalue:
return problems, vals[:i], [vals[i]], vals[i+1:]
else:
return problems, vals[:i], [], vals[i:]
If you search for "python split predicate" you can also find other ways to implement the initial split—although keep in mind that most people are either concerned with being able to partition arbitrary iterables (which you don't need here), or, rightly or not, worried about efficiency (which you don't care about here either). So, don't look for the answer that someone says is "best"; look at all of the answers, and pick the one that seems most readable to you.
Finally, you may notice that you end up with three levels that look almost identical:
yearDirs = os.listdir(dirRoot)
problems, older, eq, newer = partitionDirs(yearDirs, oldKeepYear)
problemList.extend(problems)
yield from older
for year in eq:
monthDirs = os.listdir(os.path.join(dirRoot, str(year)))
problems, older, eq, newer = partitionDirs(monthDirs, oldKeepMonth)
problemList.extend(problems)
yield from older
for month in eq:
dayDirs = os.listdir(os.path.join(dirRoot, str(year), str(month)))
problems, older, eq, newer = partitionDirs(dayDirs, oldKeepDay)
problemList.extend(problems)
yield from older
yield from eq
You can simplify this further through recursion—pass down the path so far, and the list of further levels to check, and you can turn this 18 lines into 9. Whether that's more readable or not depends on how well you manage to encode the information to pass down and the appropriate yield from. Here's a sketch of the idea:
def doLevel(pathSoFar, dateComponentsLeft):
if not dateComponentsLeft:
return
dirs = os.listdir(pathSoFar)
problems, older, eq, newer = partitionDirs(dirs, dateComponentsLeft[0])
problemList.extend(problems)
yield from older
if eq:
yield from doLevel(os.path.join(pathSoFar, eq[0]), dateComponentsLeft[1:]))
yield from doLevel(rootPath, [oldKeepYear, oldKeepMonth, oldKeepDay])
If you're on an older Python version that doesn't have yield from, the earlier stuff is almost trivial to transform; the recursive version as written will be uglier and more painful. But there's really no way to avoid this when dealing with recursive generators, because a sub-generator cannot "yield through" a calling generator.
I would suggest not using generators unless you are absolutely sure you need them. In this case, you don't need them.
In the below, newer_list isn't strictly needed. While categorizeSubdirs could be made recursive, I don't feel that the increase in complexity is worth the repetition savings (but that's just a personal style issue; I only use recursion when it's unclear how many levels of recursion are needed or the number is fixed but large; three isn't enough IMO).
def categorizeSubdirs(keep_int, base_path):
older_list = []
equal_list = []
newer_list = []
problem_list = []
for subdir_str in os.listdir(base_path):
subdir_path = os.path.join(base_path, subdir_str))
try:
subdir_int = int(subdir_path)
except ValueError:
problem_list.append(subdir_path)
else:
if subdir_int keep_int:
newer_list.append(subdir_path)
else:
equal_list.append(subdir_path)
# Note that for your case, you don't need newer_list,
# and it's not clear if you need problem_list
return older_list, equal_list, newer_list, problem_list
def get_dirs_to_remove(dir_path, olderThanDays):
oldest_dt = datetime.datetime.now() datetime.timedelta(days= -olderThanDays)
remove_list = []
problem_list = []
olderYear_list, equalYear_list, newerYear_list, problemYear_list = categorizeSubdirs(oldest_dt.year, dir_path))
remove_list.extend(olderYear_list)
problem_list.extend(problemYear_list)
for equalYear_path in equalYear_list:
olderMonth_list, equalMonth_list, newerMonth_list, problemMonth_list = categorizeSubdirs(oldest_dt.month, equalYear_path))
remove_list.extend(olderMonth_list)
problem_list.extend(problemMonth_list)
for equalMonth_path in equalMonth_list:
olderDay_list, equalDay_list, newerDay_list, problemDay_list = categorizeSubdirs(oldest_dt.day, equalMonth_path))
remove_list.extend(olderDay_list)
problem_list.extend(problemDay_list)
return remove_list, problem_list
The three nested loops at the end could be made less repetitive at the cost of code complexity. I don't think that it's worth it, though reasonable people can disagree. All else being equal, I prefer simpler code to slightly more clever code; as they say, reading code is harder than writing it, so if you write the most clever code you can, you're not going to be clever enough to read it. :/

Categories

Resources