i have just started using python and i can't figure out what is meant by parallel list
any info would be great . i think it is just using two list to store info
"Parallel lists" is a variation on the term "parallel array". The idea is that instead of having a single array/list/collection of records (objects with attributes, in Python terminology) you have a separate array/list/collection for each field of a conceptual record.
For example, you could have Person records with a name, age, and occupation:
people = [
Person(name='Bob', age=38, occupation=PROFESSIONAL_WEASEL_TRAINER),
Person(name='Douglas', age=42, occupation=WRITER),
# etc.
]
or you could have "parallel lists" for each attribute:
names = ['Bob', 'Douglas', ...]
ages = [38, 42, ...]
occupations = [PROFESSIONAL_WEASEL_TRAINER, WRITER, ...]
Both of these approaches store the same information, but depending on what you're doing one may be more efficient to deal with than the other. Using a parallel collection can also be handy if you want to sort of "annotate" a given collection without actually modifying the original.
(Parallel arrays were also really common in languages that didn't support proper records but which did support arrays, like many versions of BASIC for 8-bit machines.)
The term 'parallel lists' doesn't exist. Maybe you're talking about iterating through two lists in parallel. Then it means you iterate both in the same time. For more read "how can I iterate through two lists in parallel in Python?".
If you're trying to iterate over corresponding items in two or more lists, see itertools.izip (or just use the zip builtin if you're using Python 3).
The only time I've seen this term in use was when I was using Haskell. See here:
http://www.haskell.org/ghc/docs/5.00/set/parallel-list-comprehensions.html
Essentially the python equivalent is:
[(x,y) for x in range(1,3) for y in range(1,3)]
However you can just use zip/izip for this.
It sometimes refers to two lists whose elements are in correspondence. See this SO question.
Related
I am writing a program to find the most similar words for a list of input words using Gensim for word2vec.
Example: input_list = ["happy", "beautiful"]
Further, I use a for loop to iterate over the list and store the output in a list data structure using the .append() function.
The final list is a list of lists having tuples. See below
results = [[('glad', 0.7408891320228577),
('pleased', 0.6632170081138611)],
[('gorgeous', 0.8353002667427063),
('lovely', 0.8106935024261475)]]
My question is how to separate the list of lists into independent lists? I followed answers from 1 and 2 which suggest unpacking like a, b = results.
But this is possible when you know the number of input elements (here 2).
Expected Output (based on above):
list_a = [('glad', 0.7408891320228577), ('pleased', 0.6632170081138611)]
list_b = [('gorgeous', 0.8353002667427063), ('lovely', 0.8106935024261475)]
But, if the number of input elements is always variable, say 4 or 5, then how do we unpack and get a reference to the independent lists at run-time?
Or what is a better data structure to store the above results so that unpacking or further processing is friendlier?
Kindly help.
If you have a variable number of query-words - sometimes 2, sometimes 5, sometimes any other number N – then you almost certainly do not want to bring those out into totally-separate variable names (like list_a, list_b, etc).
Why not? Well, your next steps will then likely be to do something to each of the N items.
And to do that, you'll then want them in some sort of indexed-list you can iterate over.
What if instead, they're in some bunch of local variables - list_a, list_b, list_c, list_d - like you've requested? Then in the case where there's only 3, some of those variables, like list_d, either won't exist (be undefined) or will hold some different signal value (like say None).
For most tasks, that will then be harder to work with - requiring awkward branches/tests for evey possible count of results.
Instead, your existing results, which is a list, where you can access each by numeric index – results[0], results[1] – either alone, or in a loop, is a much more typically-useful structure when the count of things you're dealing with will vary.
If you think you have a valid reason for your expected end-state, please describe the reason, and especially the next things you then want to do, in more detail, via an expansion to the question. And consider those next steps for several different scenarios: just 1 set of results, 2 ests of results, 5 sets of results, 100 sets of results. (In that last case, what would you even name the variables, beyond list_z?)
(Separately, this is not really a question about Gensim or word2vec, but about core Python language features and variable/data-structure handling. So I've removed those tags, and added destructuring, a term for the sort of multiple-variable assignment that almost does what you need but isn't quite right, and will tune the title a bit.)
Basically, I have some variables and I want to quickly iterate through it.
I see three possibilities:
Using a list
Using a tuple
Using an implicit tuple
Respectively for example:
for regex in [regex_mail, regex_name]:
...
for regex in (regex_mail, regex_name):
...
for regex in regex_mail, regex_name:
...
Is there any reference indicating the syntax I should to use?
I looked at PEP8 but nothing is said about it.
I know this question may look as "primarily opinion based", but I am looking for concrete arguments that might allow me to choose the most adapted style (and PEP20 states that "There should be preferably only one way to do it").
First, any performance difference between the 3 syntaxes is probably negligible.
Then, this answer does a good job at describing the difference between lists and tuples:
Tuples are heterogeneous data structures (i.e., their entries have
different meanings), while lists are homogeneous sequences. Tuples
have structure, lists have order.
The usual example is a collection of GPS coordinates. Use a tuple to separate X, Y and Z, use a list to separate coordinates :
[(48.77, 9.18, 400), (48.77, 9.185, 405), (48.77, 9.19, 410)]
According to this philosophy, a list should be used in your case.
I have a .NET structure that has arrays with arrays in it. I want to crete a list of members of items from a specific array in a specific array using list comprehension in IronPython, if possible.
Here is what I am doing now:
tag_results = [item_result for item_result in results.ItemResults if item_result.ItemId == tag_id][0]
tag_vqts = [vqt for vqt in tag_results.VQTs]
tag_timestamps = [vqt.TimeStamp for vqt in tag_vqts]
So, get the single item result from the results array which matches my condition, then get the vqts arrays from those item results, THEN get all the timestamp members for each VQT in the vqts array.
Is wanting to do this in a single statement overkill? Later on, the timestamps are used in this manner:
vqts_to_write = [vqt for vqt in resampled_vqts if not vqt.TimeStamp in tag_timestamps]
I am not sure if a generator would be appropriate, since I am not really looping through them, I just want a list of all the timestamps for all the item results for this item/tag so that I can test membership in the list.
I have to do this multiple times for different contexts in my script, so I was just wondering if I am doing this in an efficient and pythonic manner. I am refactoring this into a method, which got me thinking about making it easier.
FYI, this is IronPython 2.6, embedded in a fixed environment that does not allow the use of numpy, pandas, etc. It is safe to assume I need a python 2.6 only solution.
My main question is:
Would collapsing this into a single line, if possible, obfuscate the code?
If collapsing is appropriate, would a method be overkill?
Two! My two main questions are:
Would collapsing this into a single line, if possible, obfuscate the code?
If collapsing is appropriate, would a method be overkill?
Is a generator appropriate for testing membership in a list?
Three! My three questions are... Amongst my questions are such diverse queries as...I'll come in again...
(it IS python...)
tag_results = [...][0] builds a whole new list just to get one item. This is what next() on a generator expression is for:
next(item_result for item_result in results.ItemResults if item_result.ItemId == tag_id)
which only iterates just enough to get a first item.
You can inline that, but I'd keep that as a separate expression for readability.
The remainder is easily put into one expression:
tag_results = next(item_result for item_result in results.ItemResults
if item_result.ItemId == tag_id)
tag_timestamps = [vqt.TimeStamp for vqt in tag_results.VQTs]
I'd make that a set if you only need to do membership testing:
tag_timestamps = set(vqt.TimeStamp for vqt in tag_results.VQTs)
Sets allow for constant time membership tests; testing against a list takes linear time as the whole list could end up being scanned for each such test.
I'm working on a script for a piece of software, and it doesn't really give me direct access to the data I need. Instead, I need to ask for each piece of information I need, and build a list of the data I'm getting. For various reasons, I need the list to be sorted. It's very easy to just build the list once, and then sort it, followed by doing stuff with it. However, I assume it would be faster to run through everything once, rather than build the list and then sort it.
So, at the moment I've basically got this:
my_list = []
for item in "query for stuff":
my_list.append("query for %s data" % item)
my_list.sort()
do_stuff(my_list)
The "query for stuff" bit is the query interface with the software, which will give me an iterable. my_list needs to contain a list of data from the contents of said iterable. By doing it like this, I'm querying for the first list, then looping over it to extract the data and put it into my_list. Then I'm sorting it. Lastly, I'm doing stuff to it with the do_stuff() method, which will loop over it and do stuff to each item.
The problem is that I can't do_stuff() to it before it's sorted, as the list order is important for various reasons. I don't think I can get away from having to loop over lists twice — once to build the list and once to do stuff to each item in it, as we won't know in advance if a recently added item at position N will stay at position N after we've added the next item — but it seems cleaner to insert each item in a sorted fashion, rather than just appending them at the end. Kind of like this:
for item in "query for stuff":
my_list.append_sorted(item)
Is it worth bothering trying to do it like this, or should I just stick to building the list, and then sorting it?
Thanks!
The short answer is: it's not worth it.
Have a look at insertion sort. The worst-case running time is O(n^2) (average case is also quadratic). On the other hand, Python's sort (also known as Timsort) will take O(n log n) in the worst case.
Yes, it does "seem" cleaner to keep the list sorted as you're inserting, but that's a fallacy.
There is no real benefit to it. The only time you'd consider using insertion sort is when you need to show the sorted list after every insertion.
The two approaches are asmptotically equivalent.
Sorting is O(n lg n) (Python uses Timsort by default, except for very small arrays), and inserting in a sorted list is O(lg n) (using binary search), which you would have to do n times.
In practice, one method or the other may be slightly faster, depending on how much of your data is already sorted.
EDIT: I assumed that inserting in the middle of a sorted list after you've found the insertion point would be constant time (i.e. the list behaved like a linked list, which is the data structure you would use for such an algorithm). This probably isn't the case with Python lists, as pointed out by Sven. This would make the "keep the list sorted" approach O(n^2), i.e. insertion sort.
I say "probably" because some list implementations switch from array to linked list as the list grows, the most notable example being CFArray/NSArray in CoreFoundation/Cocoa. This may or may not be the case with Python.
Take a look at the bisect module. It gives you various tools for maintaining a list order. In your case, you probably want to use bisect.insort.
for item in query_for_stuff():
bisect.insort( my_list, "query for %s data" % item )
I have a large python dictionary (65535 key:value pairs), where key is range(0, 65536) and the values are integers.
The solution I found to sorting this data structure is posted here:
Sort a Python dictionary by value
That solution works, but is not necessarily very fast.
To further complicate the problem, there is a potential for me to have many (thousands) of these dictionaries that I must combine prior to sorting. I am currently combining these dictionaries by iterating over the pairs in one dictionary, doing a key lookup in the other dictionary, and adding/updating the entry as appropriate.
This makes my question two fold:
1)Is a dictionary the right data structure for this problem? Would a custom tree or something else make more sense?
2)If dictionary is the smart, reasonable choice, what is the ideal way to combine multiples of the dictionary and then sort it?
One solution may be for me to redesign my program's flow in order to decrease the number of dictionaries being maintained to one, though this is more of a last resort.
Thanks
A dictionary populated with 65535 entries with keys from the range (0:65536) sounds suspiciously like an array. If you need sorted arrays, why are you using dictionaries?
Normally, in Python, you would use a list for this type of data. In your case, since the values are integers, you might also want to consider using the array module. You should also have a look at the heapq module since if your data can be represented in this way, there is a builtin merge function that could be used.
In any case, if you need to merge data structures and produce a sorted data structure as a result, it is best to to use a merge algorithm and one possibility for that is a mergesort algorithm.
There's not enough information here to say which data structure you should use, because we don't know what else you're doing with it.
If you need to be able to quickly insert records into the data structure later one at a time, then you do need a tree-like data structure, which unfortunately doesn't have a standard implementation (or even a standard interface, for some operations) in Python.
If you only need to be able to do what you said--to sort existing data--then you can use lists. The sorting is quick, especially if parts of the data are already sorted, and you can use binary searching for fast lookups. However, inserting elements will be O(n) rather than the O(log n) you'll get with a tree.
Here's a simple example, converting the dicts to a list or tuples, sorting the combined results and using the bisect module to search for items.
Note that you can have duplicate keys, showing up in more than one dict. This is handled easily: they'll be sorted together naturally, and the bisection will give you a [start, end) range containing all of those keys.
If you want to add blocks of data later, append it to the end and re-sort the list; Python's sorting is good at that and it'll probably be much better than O(n log n).
This code assumes your keys are integers, as you said.
dataA = { 1: 'data1', 3: 'data3', 5: 'data5', 2: 'data2' }
dataB = { 2: 'more data2', 4: 'data4', 6: 'data6' }
combined_list = dataA.items() + dataB.items()
combined_list.sort()
print combined_list
import bisect
def get_range(data, value):
lower_bound = bisect.bisect_left(data, (value, ))
upper_bound = bisect.bisect_left(data, (value+1, ))
return lower_bound, upper_bound
lower_bound, upper_bound = get_range(combined_list, 2)
print lower_bound, upper_bound
print combined_list[lower_bound:upper_bound]
With that quantity of data, I would bite the bullet and use the built in sqlite module. Yes you give up some python flexibility and have to use SQL, but right now its sorting 65k values; next it will be find values that meet certain criteria. So instead of reinventing relational databases, just go the SQL route now.