Most efficient method of referencing large variables in user defined Python functions? - python

I was wondering if there was a more efficient method of referencing large variables (such as arrays with hundreds of thousands of entries) from a function in Python than simply passing it in as an argument? I know global is an option, but it's so... unreliable, for lack of a better word, I pretty much consider it irrelevant (unless, perhaps, somebody can explain why this isn't the case). I ask because I recently wrote a script which calls the function:
def build(unique,gene,index): ###Concatenates entries from arguments into single string###
###Builds array from entries in all of unique's sublists###
hold= []
hold.append([category[index] for category in unique[1]])
###Builds a list of string concatenated from entries in other lists/arrays###
line= ['\t'.join(gene[0:7]),'\t'.join(hold[0]),'\t'.join(gene[9:len(gene)])]
###Concatenates array in a single string###
line= '\t'.join(line)
return line
From the loop:
for gene in table[1:]:
buffer.append(build(unique,gene,table.index(gene)))
The variable unique is an array with about 500k entries and the loop runs about 60k times. I understand that this is bound to take a while (it's currently sitting at about 12 minutes for this loop alone), but am hoping there's a way to optimize the method through which the unique is referenced in the function so a massive array doesn't have to be passed every time.
Thanks in advance!

There is nothing large being passed here. unique is a reference to the same list every time; nothing is copied on a function call.
You will need to look elsewhere for optimisations.

Related

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

3 questions about generators and iterators in Python

Everyone says you lose the benefit of generators if you put the result into a list.
But you need a list, or a sequence, to even have a generator to begin with, right? So, for example, if I need to go through the files in a directory, don't I have to make them into a list first, like with os.listdir()? If so, then how is that more efficient? (I am always working with strings and files, so I really hate that all the examples use range and integers, but I digress)
Taking it a step further, the mere presence of the yield keyword is supposed to make a generator. So if I do:
for x in os.listdir():
yield x
Is a list still being created? Or is os.listdir() itself now also magically a generator? Is it possible that, os.listdir() not having been called yet, that there really isn't a list here yet?
Finally, we are told that iterators need iter() and next() methods. But doesn’t that also mean they need an index? If not, what is next() operating on? How does it know what is next without an index? Before 3.6, dict keys had no order, so how did that iteration work?
No.
See, there's no list here:
def one():
while True:
yield 1
Index and next() are two independent tools to perform an iteration. Again, if you have an object such that its iterator's next() always returns 1, you don't need any indices.
In deeper detail...
See, technically, you can always associate a list and an index with any generator or iterator: simply write down all its returned values — you'll get at most countable set of values a₀, a₁, ... But those are merely a mathematical formalism quite unnecessarily having anything in common with how a real generator works. For instance, you have a generator that always yields one. You can count how many ones have you got from it so far, and call that an index. You can write down all that ones, comma-separated, and call that a list. Do those two objects correctly describe your elapsed generator's output? Apparently so. Are they in a least bit important for the generator itself? Not really.
Of course, a real generator will probably have a state (you can call it an index—provided you don't necessarily call something an index if it is only a non-negative integral scalar; you can write down all its states, provided it works deterministically, number them and call current state's number index—yes, approximately that). They will always have a source of their states and returned values. So, indices and lists can be regarded as abstractions that describe object's behaviour. But quite unnecessary they are concrete implementation details that are really used.
Consider unbuffered file reader. It retrieves a single byte from the disk and immediately yields it. There's no a real list in memory, only the file contents on the disk (there may even be no, if our file reader is connected to a net socket instead of a real disk drive, and the Oracle of Delphi is at connection's other end). You can call file position index—until you read the stdin, which is only forward-traversable and thus indexing it makes no real physical sense—same goes for network connections via unreliable protocol, BTW.
Something like this.
1) This is wrong; it is just the easiest example to explain a generator from a list. If you think of the 8 queens-problem and you return each position as soon as the program finds it, I can't recognize a result list anywhere. Note, that often iterators are alternately offered even by python standard library (islice() vs. slice(), and an easy example not representable by a list is itertools.cycle().
In consequence 2 and 3 are also wrong.

how to speed up expanding list from map iterator

I have some text data in a pandas column. Basically each document is part of the column value. Each document is multi sentence long.
I wanted to split each document into sentence and then for each sentence I want to get a list of words. So if a document is 5 sentence long, I will have a list of list of words with length 5.
I used a mapper function to do some operations on that and got a list of words for each sentence of a text. Here is a mapper code:
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
nlp=spacy.load('en')
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
return txt_to_words
Then I did this:
%%time
txt_to_words=map(text_to_words,pandas_df.log_text_cleaned)
It got done in 70 micro seconds and I got a mapper iterator.
Now if I want to add each list of list of words of each document as a new value of a new column in the same pandas data frame.
I can simply do this:
txt_to_words=[*map(text_to_words,pandas_df.log_text_cleaned)]
Which will expand the map iterator and store it in txt_to_words as list of list of words.
But this process is very slow.
I even tried looping over the map object :
txt_to_words=map(text_to_words,pandas_df.log_text_cleaned)
txt_to_words_list=[]
for sent in txt_to_words:
txt_to_words_list.append(sent)
But this is similar slow.
extracting the output from a mapper object is very slow. And I just have 67K documents in that pandas data frame column.
Is there a way this can be sped up?
Thanks
The direct answer to your question is that the fastest way to convert an iterator to a list is probably by calling list on it, although that may depend on the size of your lists.
However, this is not going to matter, except to an unnoticeable, barely-measurable degree.
The difference between list(m), [*m], or even an explicit for statement is a matter of microseconds at most, but your code is taking seconds. In fact, you could even eliminate almost all the work done by list by using collections.deque(m, maxlen=0) (which just throws away all of the values without allocating anything or storing them), and you still won't see a difference.
Your real problem is that the work done for each element is slow.
Calling map doesn't actually do that work. All it does is construct a lazy iterator that sets up the work to be done later. When is later? When you convert the iterator to a list (or consume it in some other way).
So, it's that text_to_words function that you need to speed up.
And there's at least one obvious candidate for how to do that:
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
nlp=spacy.load('en')
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
return txt_to_words
You're loading in an entire English tokenizer/dictionary/etc. for each sentence? Sure, you'll get some benefit from caching after the first time, but I'll bet it's still way too slow to do for every sentence.
If you were trying to speed things up by making it a local variable rather than a global (which probably won't matter, but it might), that's not the way to do it; this is:
nlp=spacy.load('en')
def text_to_words(x, *. _nlp=nlp):
""" This function converts sentences in a text to a list of words
"""
txt_to_words= [str(doc).replace(".","").split(" ") for doc in _nlp(x).sents]
return txt_to_words

What is a "Physically Stored Sequence" in Python?

I am currently reading Learning Python, 5th Edition - by Mark Lutz and have come across the phrase "Physically Stored Sequence".
From what I've learnt so far, a sequence is an object that contains items that can be indexed in sequential order from left to right e.g. Strings, Tuples and Lists.
So in regards to a "Physically Stored Sequence", would that be a Sequence that is referenced by a variable for use later on in a program? Or am not getting it?
Thank you in advance for your answers.
A Physically Stored Sequence is best explained by contrast. It is one type of "iterable" with the main example of the other type being a "generator."
A generator is an iterable, meaning you can iterate over it as in a "for" loop, but it does not actually store anything--it merely spits out values when requested. Examples of this would be a pseudo-random number generator, the whole itertools package, or any function you write yourself using yield. Those sorts of things can be the subject of a "for" loop but do not actually "contain" any data.
A physically stored sequence then is an iterable which does contain its data. Examples include most data structures in Python, like lists. It doesn't matter in the Python parlance if the items in the sequence have any particular reference count or anything like that (e.g. the None object exists only once in Python, so [None, None] does not exactly "store" it twice).
A key feature of physically stored sequences is that you can usually iterate over them multiple times, and sometimes get items other than the "first" one (the one any iterable gives you when you call next() on it).
All that said, this phrase is not very common--certainly not something you'd expect to see or use as a workaday Python programmer.

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.
In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"
the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.
The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.
for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

Categories

Resources