Fast way of slicing columns from tuples - python

I have a huge list of tuples from which I want to extract individual columns. I have tried two methods.
Assuming the name of the list name is List and I want to extract the jth column.
First one is
column=[item[j] for item in List]
Second one is
newList=zip(*List)
column=newList[j]
However both the methods are too slow since the length of the list is about 50000 and length of each tuple is about 100. Is there a faster way to extract the columns from the list?

this is something numpy does well
A = np.array(Lst) # this step may take a while now ... maybe you should have Lst as a np.array before you get to this point
sliced = A[:,[j]] # this should be really quite fast
that said
newList=zip(*List)
column=newList[j]
takes less than a second for me with a 50kx100 tuple ... so maybe profile your code and make sure the bottleneck is actually where you think it is...

Related

Modify list elements whose indices are defined by a list without a for loop

I want to modify list elements (e.g. putting them equal to 1) whose indices are defined by a list.
A (wrong) idea could be:
my_list = [1,2,3,11,22,4]
my_index = [1,3,4]
[my_list[i] = 1 for i in my_index]
There is always the brute forcing:
for i in my_index:
my_list[i] = 1
Is there a more efficient way to do this? Is there a way to vectorize this problem? I can also keep different element types from the list.
There's nothing wrong with the "brute forcing", it's readable and clear.
There are ways to do this with e.g. numpy arrays that may be faster. But do you really need more speed?

Python Matching Multiple Keys/ Unique Pairs to a Value

What would be the fastest, most efficient way to grab and map multiple values to one value. For a use case example, say you are multiplying two numbers and you want to remember if you have multiplied those numbers before. Instead of making a giant matrix of X by Y and filling it out, it would be nice to query a Dict to see if dict[2,3] = 6 or dict[3,2] = 6. This would be especially useful for more than 2 values.
I have seen an answer similar to what I'm asking here, but would this be O(n) time or O(1)?
print value for matching multiple key
for key in responses:
if user_message in key:
print(responses[key])
Thanks!
Seems like the easiest way to do this is to sort the values before putting them in the dict. Then sort the x,y... values before looking them up. And note that you need to use tuples to map into a dictionary (lists are mutable).
the_dict = {(2,3,4): 24, (4,5,6): 120}
nums = tuple(sorted([6,4,5]))
if nums in the_dict:
print(the_dict[nums])

Choosing N changing points in a sorted list

Imagine we have a sorted list with size P. How can we choose N indices for which the values reflect the range of the list more smoothly. For example if our list is:
List=[0,0,0,0,0,0,0,0,0,0.1,0.1,0.9,0.91,0.91,0.92,0.99,0.99,0.99]
Then how we choose let's say 5 indices that somehow shows the full range of the list?
In this example it would be something like :
indices=[0,9,11,14,15]
The final indices list doesn't have to be exactly like the one I wrote here though
This will give you a starting point:
[List.index(x) for x in set(List)]
Now this may have too many elements but "somehow" is totally subjective and not a clear enough definition for what you need to do. As a default you can keep the first and last element, then randomly pick as many as you require from the "middle".

Append to Numpy Using a For Loop

I am working on a Python script that takes live streaming data and appends it to a numpy array. However I noticed that if I append to four different arrays one by one it works. For example:
openBidArray = np.append(openBidArray, bidPrice)
highBidArray = np.append(highBidArray, bidPrice)
lowBidArray = np.append(lowBidArray, bidPrice)
closeBidArray = np.append(closeBidArray, bidPrice)
However If I do the following it does not work:
arrays = ["openBidArray", "highBidArray", "lowBidArray", "closeBidArray"]
for array in arrays:
array = np.append(array, bidPrice)
Any idea on why that is?
Do this instead:
arrays = [openBidArray, highBidArray, lowBidArray, closeBidArray]
In other words, your list should be a list of arrays, not a list of strings that coincidentally contain the names of arrays you happen to have defined.
Your next problem is that np.append() returns a copy of the array with the item appended, rather than appending in place. You store this result in array, but array will be assigned the next item from the list on the next iteration, and the modified array will be lost (except for the last one, of course, which will be in array at the end of the loop). So you will want to store each modified array back into the list. To do that, you need to know what slot it came from, which you can get using enumerate().
for i, array in enumerate(arrays):
arrays[i] = np.append(array, bidPrice)
Now of course this doesn't update your original variables, openBidArray and so on. You could do this after the loop using unpacking:
openBidArray, highBidArray, lowBidArray, closeBidArray = arrays
But at some point it just makes more sense to store the arrays in a list (or a dictionary if you need to access them by name) to begin with and not use the separate variables.
N.B. if you used regular Python lists here instead of NumPy arrays, some of these issues would go away. append() on lists is an in-place operation, so you wouldn't have to store the modified array back into the list or unpack to the individual variables. It might be feasible to do all the appending with lists and then convert them to arrays afterward, if you really need NumPy functionality on them.
In your second example, you have strings, not np.array objects. You are trying to append a number(?) to a string.
The string "openBidArray" doesn't hold any link to an array called openBidArray.

Python Spark split list into sublists divided by the sum of value inside elements

I try to split a list of objects in python into sublists based on the cumulative value of one of the parameters in the object. Let me present it on the example:
I have a list of objects like this:
[{x:1, y:2}, {x:3, y:2}, ..., {x:5, y: 1}]
and I want to divide this list into sub-lists where the total sum of x values inside a sublist will be the same (or roughly the same) so the result could look like this:
[[{x:3, y:1}, {x:3, y:1}, {x:4, y:1}], [{x:2, y:1}, {x:2, y:1}, {x:6, y:1}]]
Where the sum of x'es is equal to 10. Objects I am working with are a little bit more complicated, and my x'es are float values. So I want to aggregate the values from the ordered list, up till the sum of x'es will be >= 10, and then start creating next sub-list.
In my case, the first list of elements is an ordered list, and the summation has to take place on the ordered list.
I done something like this already in C#, where I iterate through all my elements, and keep one counter of "x" value. I sum the value of x for consecutive objects, until it will hit my threshold, and then I create a new sub-list, and restart my counter.
Now I want to reimplement it in python, and next use it with Spark. So I am looking for a little bit more "functional" implementation, maybe something to work nicely with map-reduce framework. I can't figure out another way than the iterative approach.
If you have any suggestions, or possible solutions, I would welcome all constructive comments.

Categories

Resources