Indexing failure/odd behaviour with array - python

I have some code that is intended to convert a 3-dimensional list to an array. Technically it works in that I get a 3-dimensional array, but indexing only works when I don't iterate accross one of the dimensions, and doesn't work if I do.
Indexing works here:
listTempAllDays = []
for j in listGPSDays:
listTempDay = []
for i in listGPSDays[0]:
arrayDay = np.array(i)
listTempDay.append(arrayDay)
arrayTemp = np.array(listTempDay)
listTempAllDays.append(arrayTemp)
arrayGPSDays = np.array(listTempAllDays)
print(arrayGPSDays[0,0,0])
It doesn't work here:
listTempAllDays = []
for j in listGPSDays:
listTempDay = []
for i in j:
arrayDay = np.array(i)
listTempDay.append(arrayDay)
arrayTemp = np.array(listTempDay)
listTempAllDays.append(arrayTemp)
arrayGPSDays = np.array(listTempAllDays)
print(arrayGPSDays[0,0,0])
The difference between the two pieces of code is in the inner for loop. The first piece of code also works for all elements in listGPSDays (e.g. for i in listGPSDays[1]: etc...).
Removing the final print call allows the code to run in the second case, or changing the final line to print(arrayGPSDays[0][0,0]) does also run.
In both cases checking the type at all levels returns <class 'numpy.ndarray'>.
I would like this array indexing to work, if possible - what am I missing?
The following is provided as example data:
Anonymised results from print(arrayGPSDays[0:2,0:2,0:2]), generated using the first piece of code (so that the indexing works! - but also resulting in arrayGPSDays[0] being the same as arrayGPSDays[1]):
[[['1' '2']
['3' '4']]
[['1' '2']
['3' '4']]]

numpy's array constructor can handle arbitrarily dimensioned iterables. They only stipulation is that they can't be jagged (i.e. each "row" in each dimension must have the same length).
Here's an example:
In [1]: list_3d = [[['a', 'b', 'c'], ['d', 'e', 'f']], [['g', 'h', 'i'], ['j', 'k', 'l']]]
In [2]: import numpy as np
In [3]: np.array(list_3d)
Out[3]:
array([[['a', 'b', 'c'],
['d', 'e', 'f']],
[['g', 'h', 'i'],
['j', 'k', 'l']]], dtype='<U1')
In [4]: array_3d = np.array(list_3d)
In [5]: array_3d[0,0,0]
Out[5]: 'a'
In [6]: array_3d.shape
Out[6]: (2, 2, 3)
If the array is jagged, numpy will "squash" down to the dimension where the jagged-ness happens. Since that explanation is clear as mud, an example might help:
In [20]: jagged_3d = [ [['a', 'b'], ['c', 'd']], [['e', 'f'], ['g', 'h'], ['i', 'j']] ]
In [21]: jagged_arr = np.array(jagged_3d)
In [22]: jagged_arr.shape
Out[22]: (2,)
In [23]: jagged_arr
Out[23]:
array([list([['a', 'b'], ['c', 'd']]),
list([['e', 'f'], ['g', 'h'], ['i', 'j']])], dtype=object)
The reason the constructor isn't working out of the box is because you have a jagged array. numpy simply does not support jagged arrays due to the fact that each numpy array has a well-defined shape representing the length of each dimension. So if the items in a given dimension are different lengths, this abstraction falls apart, and numpy simply doesn't allow it.
HTH.

So Isaac, it seems your code have some syntax misinterpretations,
In your for statement, j represents an ITEM inside the list listGPSDays (I assume it is a list), not the ITEM INDEX inside the list, and you don't need to "get" the range of the list, python can do it for yourself, try:
for j in listGPSdays:
instead of
for j in range(len(listGPSDays)):
Also, try changing this line of code from:
for i in listGPSDays[j]:
to:
for i in listGPSDays.index(j):
I think it will solve your problem, hope it works!

Related

Efficiency question: how to compare two huge nested lists and make changes based on criteria

I want to compare two huge identical nested lists and by iterating over both of them. I'm looking for nested lists in where list_a[0] is equal to list_b[1]. In that case I want to merge those lists (the order is important). The non-matches lists I also want in the output.
rows_a = [['a', 'b', 'z'], ['b', 'e', 'f'], ['g', 'h', 'i']]
rows_b = [['a', 'b', 'z'], ['b', 'e', 'f'], ['g', 'h', 'i']]
data = []
for list_a in rows_a:
for list_b in rows_b:
if list_a[0] == list_b[1]:
list_b.extend(list_a)
data.append(list_b)
else:
data.append(list_b)
#print(data): [['a', 'b', 'z', 'b', 'e', 'f'], ['b', 'e', 'f'], ['g', 'h', 'i'], ['a', 'b', 'z', 'b', 'e', 'f'], ['b', 'e', 'f'], ['g', 'h', 'i'], ['a', 'b', 'z', 'b', 'e', 'f'], ['b', 'e', 'f'], ['g', 'h', 'i']]
Above is the output that I do NOT want, because it is way too much data. All this unnecessary data is caused by the double loop over both rows. A solution would be to slice an element off rows_b by every iteration of the for loop on rows_a. This would avoid many duplicate comparisons. Question: How do I skip first element of a list every time it has looped from start to end?
In order to show the desired outcome, I correct the outcome by deleting duplicates below:
res=[]
for i in data:
if tuple(i) not in res:
res.append(tuple(i))
print(res)
#Output: [('a', 'b', 'z', 'b', 'e', 'f'), ('b', 'e', 'f'), ('g', 'h', 'i')]
This is the output I want! But faster...And preferably without removing duplicates.
I managed to get what I want when I work with a small data set. However, I am using this for a very large data set and it gives me a 'MemoryError'. Even if it didn't give me the error, I realise it is a very inefficient script and it takes a lot of time to run.
Any help would be greatly appreciated.
tuple(i) not in res is not efficient since it iterate over the whole list over and over in linear time resulting in a quadratic execution time (O(n²)). You can speed this up using a set:
list({tuple(e) for e in data})
This does not preserve the order. If you want to do that, then you can use a dict (requires a quire recent version of Python):
list({tuple(e): None for e in data}.keys())
This should be significantly faster. An alternative solution is to convert the element to tuple, then sort them and compare close pairs of values so to remove duplicates. Note you can also merge two set or two dict with the update method.
As for the memory space, there is not much to do. The problem is CPython itself which is clearly not designed for computing large data with such data structure (only native data structures like Numpy arrays are efficient). Each character is encoded as a Python object taking 24-32 bytes. Lists contains references to objects taking 8 bytes each on a 64-bit architecture. This means 40 bytes per characters while 1 byte is actually needed (and this is what a native C/C++ program can actually use in practice). That being said CPython can cache 1-byte character so to use "only" 8 byte per character in this specific case (which is still 8 time more than required). If you use list of characters in your real-world application, please consider using string instead. Otherwise, please consider using another language.
I solved this by using a LEFT JOIN in SQL. You can do the same thing with Pandas Data Frames in Python.

How can I split a list in two unique lists in Python?

Hi I have a list as following:
listt = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
15 members.
I want to turn it into 3 lists, I used this code it worked but I want unique lists. this give me 3 lists that have mutual members.
import random
listt = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
print(random.sample(listt,5))
print(random.sample(listt,5))
print(random.sample(listt,5))
Try this:
from random import shuffle
def randomise():
listt = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
shuffle(listt)
return listt[:5], listt[5:10], listt[10:]
print(randomise())
This will print (for example, since it is random):
(['i', 'k', 'c', 'b', 'a'], ['d', 'j', 'h', 'n', 'f'], ['e', 'l', 'o', 'g', 'm'])
If it doesn't matter to you which items go in each list, then you're better off partitioning the list into thirds:
In [23]: L = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
In [24]: size = len(L)
In [25]: L[:size//3]
Out[25]: ['a', 'b', 'c', 'd', 'e']
In [26]: L[size//3:2*size//3]
Out[26]: ['f', 'g', 'h', 'i', 'j']
In [27]: L[2*size//3:]
Out[27]: ['k', 'l', 'm', 'n', 'o']
If you want them to have random elements from the original list, you'll just need to shuffle the input first:
random.shuffle(L)
Instead of sampling your list three times, which will always give you three independent results where individual members may be selected for more than a single list, you could just shuffle the list once and then split it in three parts. That way, you get three random subsets that will not share any items:
>>> random.shuffle(listt)
>>> list[0:5]
>>> listt[0:5]
['b', 'a', 'f', 'e', 'h']
>>> listt[5:10]
['c', 'm', 'g', 'j', 'o']
>>> listt[10:15]
['d', 'l', 'i', 'n', 'k']
Note that random.shuffle will shuffle the list in place, so the original list is modified. If you don’t want to modify the original list, you should make a copy first.
If your list is larger than the desired result set, then of course you can also sample your list once with the combined result size and then split the result accordingly:
>>> sample = random.sample(listt, 5 * 3)
>>> sample[0:5]
['h', 'm', 'i', 'k', 'd']
>>> sample[5:10]
['a', 'b', 'o', 'j', 'n']
>>> sample[10:15]
['c', 'l', 'f', 'e', 'g']
This solution will also avoid modifying the original list, so you will not need a copy if you want to keep it as it is.
Use [:] for slicing all members out of the list which basically copies everything into a new object. Alternatively just use list(<list>) which copies too:
print(random.sample(listt[:],5))
In case you want to shuffle only once, store the shuffle result into a variable and copy later:
output = random.sample(listt,5)
first = output[:]
second = output[:]
print(first is second, first is output) # False, False
and then the original list can be modified without the first or second being modified.
For nested lists you might want to use copy.deepcopy().

Is this correct use of flatten?

I am attempting to flatten a list using:
wd = ['this' , 'is']
np.asarray(list(map(lambda x : list(x) , wd))).flatten()
which returns:
array([['t', 'h', 'i', 's'], ['i', 's']], dtype=object)
when I'm expecting a char array: ['t','h','i','s','i','s']
Is this correct use of flatten?
No, this isn't a correct use for numpy.ndarray.flatten.
Two-dimensional NumPy arrays have to be rectangular or they will be cast to object arrays (or it will throw an exception). With object arrays flatten won't work correctly (because it won't flatten the "objects") and rectangular is impossible because your words have different lengths.
When dealing with strings (or arrays of strings) NumPy won't flatten them at all, neither if you create the array, nor when you try to "flatten" it:
>>> import numpy as np
>>> np.array(['fla', 'tten'])
array(['fla', 'tten'], dtype='<U4')
>>> np.array(['fla', 'tten']).flatten()
array(['fla', 'tten'], dtype='<U4')
Fortunately you can simply use "normal" Python features to flatten iterables, just to mention one example:
>>> wd = ['this' , 'is']
>>> [element for sequence in wd for element in sequence]
['t', 'h', 'i', 's', 'i', 's']
You might want to have a look at the following Q+A for more solutions and explanations:
Making a flat list out of list of lists in Python
Flatten (an irregular) list of lists
with just a list iteration:
[u for i in np.asarray(list(map(lambda x : list(x) , wd))) for u in i]
gives you this:
['t', 'h', 'i', 's', 'i', 's']
Although, as the comments say, you can just use ''.join() for your specific example, this has the advantage of working for numpy arrays and lists of lists:
test = np.array(range(10)).reshape(2,-1)
[u for i in test for u in i]
returns a flat list:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In[8]: from itertools import chain
In[9]: list(chain.from_iterable(['this' , 'is']))
Out[9]: ['t', 'h', 'i', 's', 'i', 's']

How to delete an object from a numpy array without knowing the index

Is it possible to delete an object from a numpy array without knowing the index of the object but instead knowing the object itself?
I have seen that it is possible using the index of the object using the np.delete function, but I'm looking for a way to do it having the object but not its index.
Example:
[a,b,c,d,e,f]
x = e
I would like to delete x.
You can find the index/indices of the object using np.argwhere, and then delete the object(s) using np.delete.
Example:
x = np.array([1,2,3,4,5])
index = np.argwhere(x==3)
y = np.delete(x, index)
print(x, y)
Cast it as a numpy array, and mask it out:
x = np.array(list("abcdef"))
x = x[x!='e'] # <-- THIS IS THE METHOD
print x
# array(['a', 'b', 'c', 'd', 'f'])
Doesn't have to be more complicated than this.
Boolean indexing or masking is a good basic way of selecting, or removing specific elements of an array
You talk about removing a specific 'object'. Let's take that literally and define an array of dtype object:
In [2]: x=np.array(['a','b','c','d','e'],dtype=object)
In [3]: x
Out[3]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)
In [4]: x=='d' # elements that equal 'd'
Out[4]: array([False, False, False, True, False], dtype=bool)
In [5]: x!='d' # elements that don't
Out[5]: array([ True, True, True, False, True], dtype=bool)
In [6]: x[x!='d'] # select a subset
Out[6]: array(['a', 'b', 'c', 'e'], dtype=object)
Behind the scenes argwhere and delete use this. Note that argwhere uses the x==d boolean array, converting it to array indices. And constructing mask like this is one way that delete operates.
There are some important limits:
that equality (or not equality) test has to work for your values. It might not if the elements are floats.
deleting from a 1d array is easier than from a 2d (or larger) one. With 2d you have to decide whether to delete a row, a column, or an element (and in the process flattening the array).
deleting only one element of that matches is a bit trickier.
For some cases it might be better to .tolist() the array and use a list method.
In [32]: xl=x.tolist()
In [33]: xl.remove('d')
In [34]: np.array(xl,dtype=object)
Out[34]: array(['a', 'b', 'c', 'e'], dtype=object)
There's no exact equivalent to list.remove for arrays.
You could use np.setdiff1d(a,b), it returns all unique elements from a that are not in b.
>>> arr = np.array(['a', 'a', 'b', 'c', 'd', 'e', 'f'])
>>> to_remove = ['b', 'c']
>>> np.setdiff1d(arr, to_remove)
array(['a', 'd', 'e', 'f'], dtype='<U1')
arr = np.array(['a','b','c','d','e','f'])
Then
arr = [x for x in arr if arr != 'e']

Iterate with last element repeated as first in next iteration

I have some list with objects looks like:
oldList=[a,b,c,d,e,f,g,h,i,j,...]
what I need is to create a new list with nested list items which will look like this:
newList=[[a,b,c,d],[d,e,f,g],[g,h,i,j]...]
or simply spoken - last element from previous nested is first element in next new nested list.
One of the ways of doing it is
>>> l = ['a','b','c','d','e','f','g','h','i','j']
>>> [l[i:i+4] for i in range(0,len(l),3)]
[['a', 'b', 'c', 'd'], ['d', 'e', 'f', 'g'], ['g', 'h', 'i', 'j'], ['j']]
Here :
l[i:i+4] implies that we print a chunk of 4 values starting from position i
range(0,len(l),3) implies that we traverse the length of the list by taking three jumps
So the basic working of this is that, we are taking a chunk of 3 elements from the list, but we are modifying the slice length so that it includes an additional element. In this way, we can have a list of 4 elements.
Small Note - The initialization oldList=[a,b,c,d,e,f,g,h,i,j,...] is invalid unless a,b,c,d,etc are previously defined. You were perhaps looking for oldList = ['a','b','c','d','e','f','g','h','i','j']
Alternatively, if you wanted a solution which would split into even sized chunks only, then you could try this code :-
>>> [l[i:i+4] for i in range(0,len(l)-len(l)%4,3)]
[['a', 'b', 'c', 'd'], ['d', 'e', 'f', 'g'], ['g', 'h', 'i', 'j']]

Categories

Resources