can this python be shorter - python

I tend to obsess about expressing code the most compactly and succinctly possible without sacrificing runtime efficiency.
Here's my code:
p_audio = plate.parts.filter(content__iendswith=".mp3")
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv")
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")
extra_context.update({
'p_audio': p_audio and p_audio[0],
'p_video': p_video and p_video[0],
'p_swf': p_swf and p_swf[0]
})
Are there any python/django gurus that can drastically shorten this code?

Actually, in your pursuit of compactness and efficiency, you have managed to come up with code that is terribly inefficient. This is because when you refer to p_audio or not p_audio, that causes that queryset to be evaluated - and because you haven't sliced it before then, that means that the entire filter is brought from the database - eg all the plate objects that end with mp3, and so on.
You should ensure you do the slice for each query first, before you refer to the value of that query. Since you're concerned with code compactness, you probably want to slice with [:1] first, to get a queryset of a single object:
p_audio = plate.parts.filter(content__iendswith=".mp3")[:1]
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv") [:1]
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")[:1]
and the rest can stay the same.
Edit to add Because you're only interested in the first element of each list, as evidenced by the fact that you only pass [0] from each element into the context. But in your code, not p_audio refers to the original, unsliced queryset: and to determine the true/false value of the qs, Django has to evaluate it, which gets all matching elements from the database and converts them into Python objects. Since you don't actually want those objects, you're doing a lot more work than you need.
Note though that it's not re-running it every time: just the first time, since after the first evaluation the queryset is cached internally. But as I say, that's already more work than you want.

Besides featuring less redundancy, this is also way easier to extend with new content types.
kinds = (("p_audio", ".mp3"), ("p_video", ".flv"), ("p_swf", ".swf"))
extra_context.update((key, False) for key, _ in kinds)
for key, ext in kinds:
entries = plate.parts.filter(content__iendswith=ext)
if entries:
extra_context[key] = entries[0]
break

Just adding this as another answer inspired by Pyroscope's above (as my edit there has to be peer reviewed)
The latest incarnation is exploiting that the Django template system just disregards nonexistant context items when they are referenced, so mp3, etc below do not need to be initialized to False (or 0). So, the following meets all the functionality of the code from the OP. The other optimization is that mp3, etc are used as key names (instead of "p_audio" etc.)
for key in ['mp3','flv','swf'] :
entries = plate.parts.filter(content__iendswith=key)[:1]
extra_context[key] = entries and entries[0]
if extra_context[key] :
break

Related

More efficient way to slice list / remove processed data in Python for loops

Processing a number of 'leads', which are all in the form of class objects and POSTing them via API to third party platform. Script works, but is slow and inefficient. Looking for ideas on how to speed it up.
ADMINS = admins.get_admins()
lead_list_ids = get_lead_list_ids(TAG) # returns dict of admin.slug / list id pairs
processed = []
for admin in tqdm(ADMINS):
lead_list_id = lead_list_ids[admin.slug]
for lead in tqdm(hunter_results):
if lead.account.owner.email.split('#')[0] == admin.slug: # splitting email to get user "initials" which is same as admin.slug
processed.append(lead)
create_lead(lead, lead_list_id)
# creates a slice modifying exisiting array...might be taking more time than it saves..
hunter_results[:] = [lead for lead in hunter_results if lead not in processed]
print(f'\nSuccess! {len(hunter_results)} leads created.')
This currently runs very slow...I originally wrote it without the 'processed' array, which caused the script to iterate over the 'hunter_results' array (3000+ items) again and again for every single Admin (user). This seemed inefficient so I decided to remove the processed leads by appending them to 'processed' list, and then filtering the original array down. To my (somewhat) surprise, this takes even longer as the slice/list comprehension is hella slow at filtering down the list.
I assume this is because the list comprehension essentially created another loop that needs to run, but I am struggling to come up with a more efficient way to do this. I do not want to remove the values from the original array during iteration for obvious reasons, but it seems doing this as a separate process is even worse. Any ideas?
I think a strategy that iterates over leads is what you want. Does this do what you seek to do? I don't think there would be a need to mutate hunter_results then as we just work through them one at a time looking for admins.
admin_slugs = set(admin.slug for admin in admins.get_admins())
lead_list_ids = get_lead_list_ids(TAG)
for lead in hunter_results:
admin_slug == lead.account.owner.email.split('#')[0]
if admin_slug not in admin_slugs:
continue
create_lead(lead, lead_list_ids[admin_slug])
You could try something like this:
ADMINS = admins.get_admins()
lead_list_ids = get_lead_list_ids(TAG) # returns dict of admin.slug / list id pairs
admin_dict = {}
for admin in tqdm(ADMINS):
admin_dict[admin.slug] = 1
for lead in tqdm(hunter_results):
initials = lead.account.owner.email.split('#')[0]
if (admin_dict.get(initials)):
create_lead(lead, lead_list_ids[initials])
Should be the same logic as your code, but it's lower time complexity.
And you are correct that the list comprehension creates another for loop.

Check if Object attribute is present in list of Object

I have an object with different attributes and a list that contains those objects.
Before adding an object to the list, I'd like to check if an attribute of this new object is present in the list.
This attribute is unique, so this is done to make sure that every object in the list is unique.
I would do something like this:
for post in stream:
if post.post_id not in post_list:
post_list.append(post)
else:
# Find old post in the list and replace it
But obviously line 2 doesn't work as I'm comparing the post_id to the object list.
Keep a separate set to which you add the attribute, and against which you can then test the next value:
ids_seen = set()
for post in stream:
if post.post_id not in ids_seen:
post_list.append(post)
ids_seen.add(post.post_id)
Another option is to create an ordered dict first, with the ids as keys:
posts = OrderedDict((post.post_id, post) for post in stream)
post_list = list(posts.values())
This will keep the most recently seen post reference for a given id, but you'll still unique ids only.
If ordering isn't important, just use a regular dictionary comprehension:
posts = {post.post_id: post for post in stream}
post_list = list(posts.values())
If you are using Python 3.6 or newer, then the order will be preserved anyway as the CPython implementation was updated to retain input order, and in Python 3.7 this feature became part of the language specification.
Whatever you do, don't use a separate list to test the post.id against, as that takes O(N) time each time you check to see if the id is present, where N is the number of items in your stream in the end. Combined with O(N) such checks, that approach would take O(N**2) quadratic time, meaning that for every 10-fold increase in the number of input items, you'd also take 100 times more time to process them all.
But when using a set or dictionary, testing if the id is already there only takes O(1) constant time, so checks are cheap. That makes a full processing loop take O(N) linear time, meaning that it'll take time directly proportional to how many input items you have.
This should work
for post in stream:
if post.post_id not in [post.post_id for post in post_list]:
post_list.append(post)

How to impliment a binary search on a list created from a file

This is my first post, please be gentle. I'm attempting to sort some
files into ascending and descending order. Once I have sorted a file, I am storing it in a list which is assigned to a variable. The user is then to choose a file and search for an item. I get an error message....
TypeError: unorderable types; int() < list()
.....when ever I try to search for an item using the variable of my sorted list, the error occurs on line 27 of my code. From research, I know that an int and list cannot be compared, but I cant for the life of me think how else to search a large (600) list for an item.
At the moment I'm just playing around with binary search to get used to it.
Any suggestions would be appreciated.
year = []
with open("Year_1.txt") as file:
for line in file:
line = line.strip()
year.append(line)
def selectionSort(alist):
for fillslot in range(len(alist)-1,0,-1):
positionOfMax=0
for location in range(1,fillslot+1):
if alist[location]>alist[positionOfMax]:
positionOfMax = location
temp = alist[fillslot]
alist[fillslot] = alist[positionOfMax]
alist[positionOfMax] = temp
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first<=last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
selectionSort(year)
testlist = []
testlist.append(year)
print(binarySearch(testlist, 2014))
Year_1.txt file consists of 600 items, all years in the format of 2016.
They are listed in descending order and start at 2017, down to 2013. Hope that makes sense.
Is there some reason you're not using the Python: bisect module?
Something like:
import bisect
sorted_year = list()
for each in year:
bisect.insort(sorted_year, each)
... is sufficient to create the sorted list. Then you can search it using functions such as those in the documentation.
(Actually you could just use year.sort() to sort the list in-place ... bisect.insort() might be marginally more efficient for building the list from the input stream in lieu of your call to year.append() ... but my point about using the `bisect module remains).
Also note that 600 items is trivial for modern computing platforms. Even 6,000 won't take but a few milliseconds. On my laptop sorting 600,000 random integers takes about 180ms and similar sized strings still takes under 200ms.
So you're probably not gaining anything by sorting this list in this application at that data scale.
On the other hand Python also includes a number of modules in its standard libraries for managing structured data and data files. For example you could use Python: SQLite3.
Using this you'd use standard SQL DDL (data definition language) to describe your data structure and schema, SQL DML (data manipulation language: INSERT, UPDATE, and DELETE statements) to manage the contents of the data and SQL queries to fetch data from it. Your data can be returned sorted on any column and any mixture of ascending and descending on any number of columns with the standard SQL ORDER BY clauses and you can add indexes to your schema to ensure that the data is stored in a manner to enable efficient querying and traversal (table scans) in any order on any key(s) you choose.
Because Python includes SQLite in its standard libraries, and because SQLite provides SQL client/server semantics over simple local files ... there's almost no downside to using it for structured data. It's not like you have to install and maintain additional software, servers, handle network connections to a remote database server nor any of that.
I'm going to walk through some steps before getting to the answer.
You need to post a [mcve]. Instead of telling us to read from "Year1.txt", which we don't have, you need to put the list itself in the code. Do you NEED 600 entries to get the error in your code? No. This is sufficient:
year = ["2001", "2002", "2003"]
If you really need 600 entries, then provide them. Either post the actual data, or
year = [str(x) for x in range(2017-600, 2017)]
The code you post needs to be Cut, Paste, Boom - reproduces the error on my computer just like that.
selectionSort is completely irrelevant to the question, so delete it from the question entirely. In fact, since you say the input was already sorted, I'm not sure what selectionSort is actually supposed to do in your code, either. :)
Next you say testlist = [].append(year). USE YOUR DEBUGGER before you ask here. Simply looking at the value in your variable would have made a problem obvious.
How to append list to second list (concatenate lists)
Fixing that means you now have a list of things to search. Before you were searching a list to see if 2014 matched the one thing in there, which was a complete list of all the years.
Now we get into binarySearch. If you look at the variables, you see you are comparing the integer 2014 with some string, maybe "1716", and the answer to that is useless, if it even lets you do that (I have python 2.7 so I am not sure exactly what you get there). But the point is you can't find the integer 2014 in a list of strings, so it will always return False.
If you don't have a debugger, then you can place strategic print statements like
print ("debug info: binarySearch comparing ", item, alist[midpoint])
Now here, what VBB said in comments worked for me, after I fixed the other problems. If you are searching for something that isn't even in the list, and expecting True, that's wrong. Searching for "2014" returns True, if you provide the correct list to search. Alternatively, you could force it to string and then search for it. You could force all the years to int during the input phase. But the int 2014 is not the same as the string "2014".

Speed up lookup item in list (via Python)

I have a very large list, and I have to run a lot of lookups for this list.
To be more specific I work on a large (> 11 Gb) textfile for processing, but there are items which are appear more than once, and I have only process them first when they are appearing.
If the pattern shows up, I process it, and put it to a list. If the item appears again, I check for it in the list, and if it is, then I just pass to process, like this:
[...]
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.append(threadid)
elif threadid in closedthreads:
pass
else:
[...]
the code itself is far from optimal. My main problem is that the 'closedthreads' list contains a few million items, and the whole operation just start to be slower and slower.
I think it could be help to sort the list (or use a 'sorted list' object) after every append() but I am not sure about this.
What is the most elegant sollution?
You can simply use a set or a hash table which marks if given id already appeared. It should solve your problem with O(1) time complexity for adding and finding an item.
Using a set instead of a list will give you O(1) lookup time, although there may be other ways to optimize this that will work better for your particular data.
closedthreads = set()
# ...
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.add(threadid)
elif threadid in closedthreads:
pass
else:
Do you need to preserve ordering?
If not - use a set.
If you do - use an OrderedDict. OrderedDict lets you store values associated with it as well (example, process results)
But... do you need to preserve the original values at all? You might look at the 'dbm' module if you absolutely do (or buy a lot of memory!) or, instead of storing the actual text, store SHA-1 digests, or something like that. If all you want to do is make sure you don't run the same element twice, that might work.

Python: Set with only existence check?

I have a set of lots of big long strings that I want to do existence lookups for. I don't need the whole string ever to be saved. As far as I can tell, the set() actually stored the string which is eating up a lot of my memory.
Does such a data structure exist?
done = hash_only_set()
while len(queue) > 0 :
item = queue.pop()
if item not in done :
process(item)
done.add(item)
(My queue is constantly being filled by other threads so I have no way of dedupping it at the start).
It's certainly possible to keep a set of only hashes:
done = set()
while len(queue) > 0 :
item = queue.pop()
h = hash(item)
if h not in done :
process(item)
done.add(h)
Notice that because of hash collisions, there is a chance that you consider an item done even though it isn't.
If you cannot accept this risk, you really need to save the full strings to be able to tell whether you have seen it before. Alternatively: perhaps the processing itself would be able to tell?
Yet alternatively: if you cannot accept to keep the strings in memory, keep them in a database, or create files in a directory with the same name as the string.
You can use a data structure called Bloom Filter specifically for this purpose. A Python implementation can be found here.
EDIT: Important notes:
False positives are possible in this data structure, i.e. a check for the existence of a string could return a positive result even though it was not stored.
False negatives (getting a negative result for a string that was stored) are not possible.
That said, the chances of this happening can be brought to a minimum if used properly and so I consider this data structure to be very useful.
If you use a secure (like SHA-256, found in the hashlib module) hash function to hash the strings, it's very unlikely that you would found duplicate (and if you find some you can probably win a prize as with most cryptographic hash functions).
The builtin __hash__() method does not guarantee you won't have duplicates (and since it only uses 32 bits, it's very likely you'll find some).
You need to know the whole string to have 100% certainty. If you have lots of strings with similar prefixes you could save space by using a trie to store the strings. If your strings are long you could also save space by using a large hash function like SHA-1 to make the possibility of hash collisions so remote as to be irrelevant.
If you can make the process() function idempotent - i.e. having it called twice on an item is only a performance issue, then the problem becomes a lot simpler and you can use lossy datastructures, such as bloom filters.
You would have to think about how to do the lookup, since there are two methods that the set needs, __hash__ and __eq__.
The hash is a "loose part" that you can take away, but the __eq__ is not a loose part that you can save; you have to have two strings for the comparison.
If you only need negative confirmation (this item is not part of the set), you could fill a Set collection you implemented yourself with your strings, then you "finalize" the set by removing all strings, except those with collisions (those are kept around for eq tests), and you promise not to add more objects to your Set. Now you have an exclusive test available.. you can tell if an object is not in your Set. You can't be certain if "obj in Set == True" is a false positive or not.
Edit: This is basically a bloom filter that was cleverly linked, but a bloom filter might use more than one hash per element which is really clever.
Edit2: This is my 3-minute bloom filter:
class BloomFilter (object):
"""
Let's make a bloom filter
http://en.wikipedia.org/wiki/Bloom_filter
__contains__ has false positives, but never false negatives
"""
def __init__(self, hashes=(hash, )):
self.hashes = hashes
self.data = set()
def __contains__(self, obj):
return all((h(obj) in self.data) for h in self.hashes)
def add(self, obj):
self.data.update(h(obj) for h in self.hashes)
As has been hinted already, if the answers offered here (most of which break down in the face of hash collisions) are not acceptable you would need to use a lossless representation of the strings.
Python's zlib module provides built-in string compression capabilities and could be used to pre-process the strings before you put them in your set. Note however that the strings would need to be quite long (which you hint that they are) and have minimal entropy in order to save much memory space. Other compression options might provide better space savings and some Python based implementations can be found here

Categories

Resources