Remove duplicates in each list of a list of lists - python

I have a list of lists:
a = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0],
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0],
[1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0],
[5.0, 5.0, 5.0],
[1.0]
]
What I need to do is remove all the duplicates in the list of lists and keep the previous sequence. Such as
a = [[1.0],
[2.0, 3.0, 4.0],
[3.0, 5.0],
[1.0, 4.0, 5.0],
[5.0],
[1.0]
]

If order is important, you can just compare to the set of items seen so far:
a = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0],
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0],
[1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0],
[5.0, 5.0, 5.0],
[1.0]]
for index, lst in enumerate(a):
seen = set()
a[index] = [i for i in lst if i not in seen and seen.add(i) is None]
Here i is added to seen as a side-effect, using Python's lazy and evaluation; seen.add(i) is only called where the first check (i not in seen) evaluates True.
Attribution: I saw this technique yesterday from #timgeb.

If you have access to the OrderedDict (in Python 2.7 on), abusing it a good way to do this:
import collections
import pprint
a = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0],
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0],
[1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0],
[5.0, 5.0, 5.0],
[1.0]
]
b = [list(collections.OrderedDict.fromkeys(i)) for i in a]
pprint.pprint(b, width = 40)
Outputs:
[[1.0],
[2.0, 3.0, 4.0],
[3.0, 5.0],
[1.0, 4.0, 5.0],
[5.0],
[1.0]]

This will help you.
a = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0],
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0],
[1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0],
[5.0, 5.0, 5.0],
[1.0]
]
for _ in range(len(a)):
a[_] = sorted(list(set(a[_])))
print a
OUTPUT:
[[1.0], [2.0, 3.0, 4.0], [3.0, 5.0], [1.0, 4.0, 5.0], [5.0], [1.0]]

Inspired by DOSHI, here's another way, probably best way for a small number of possible elements (i.e. a small number of index lookups for sorted) otherwise a way that remembers insertion order may be better:
b = [sorted(set(i), key=i.index) for i in a]
So just to compare the methods, a seen set versus sorting a set by an original index lookup:
>>> setup = 'l = [1,2,3,4,1,2,3,4,1,2,3,4]*100'
>>> timeit.repeat('sorted(set(l), key=l.index)', setup)
[23.231241687943111, 23.302754517266294, 23.29650511717773]
>>> timeit.repeat('seen = set(); [i for i in l if i not in seen and seen.add(i) is None]', setup)
[49.855933579601697, 50.171151882997947, 51.024657420945005]
Here we see that for a larger case where, the contain test that Jon uses for every element becomes relatively very costly, and since insertion order is quickly determined by index in this case, this method is much more efficient.
However, by appending more elements to the end of the list, we see that Jon's method does not bear much increased cost, whereas mine does:
>>> setup = 'l = [1,2,3,4,1,2,3,4,1,2,3,4]*100 + [8,7,6,5]'
>>> timeit.repeat('sorted(set(l), key=l.index)', setup)
[93.221347206941573, 93.013769266020972, 92.64512197257136]
>>> timeit.repeat('seen = set(); [i for i in l if i not in seen and seen.add(i) is None]', setup)
[51.042504915545578, 51.059295348750311, 50.979311841569142]
I think I'd prefer Jon's method with a seen set, given the bad lookup times for the index.

Related

Python - Reordering the final list in a list of lists and having all corresponding list indices change to the same index ordering

I have a list of lists in python i.e.
[[6.0, 3.0, 16.0, 3.0], [3.0, 2.0, 5.0, 7.0], [4.0, 3.0, 2.0, 1.0]]
I then want to order the final list in the list of lists by ascending numerical size, but the change of order in the indexes of this list I want to be applied to the other corresponding indexes in the other lists within the list. For example,
[[6.0, 3.0, 16.0, 3.0], [3.0, 2.0, 5.0, 7.0], [4.0, 3.0, 2.0, 1.0]]
turns into
[[3.0, 16.0, 3.0, 6.0], [7.0, 5.0, 2.0, 3.0], [1.0, 2.0, 3.0, 4.0]]
Apologies if this isn't worded greatly, I am rather new to python.
I have looked into using the zip and sorted functions however haven't been able to use them to the effect I want to.
One way to do this is to associate to each number in the list you are ordering to an incrementing index - and then use this incrementing index as target for each element in the previous lists.
def order_by_last(data):
indexes = list(enumerate(data[-1]))
indexes.sort(key=lambda pair: pair[1])
new_list = [[sublist[index[0]] for index in indexes] for sublist in data]
return new_list
In [56]: order_by_last([[6.0, 3.0, 16.0, 3.0], [3.0, 2.0, 5.0, 7.0], [4.0, 3.0, 2.0, 1.0]])
Out[56]: [[3.0, 16.0, 3.0, 6.0], [7.0, 5.0, 2.0, 3.0], [1.0, 2.0, 3.0, 4.0]]
I'm not sure if you are willing to use external libraries, but you need an argsort for this one from numpy argsort. Note that the output is not a python list, but rather a numpy array (which can be converted though).
So you can get your result by doing the following:
# done list_of_lists
list_order = argsort(list_of_lists[-1])
new_list = []
for single_list in list_of_lists:
buffer_list = []
for position in list_order:
buffer_list.append(single_list[position])
new_list.append(buffer_list)
Keep in mind though that if your lists are different sizes, this might break.
Create a sorted list of indexes based on the last list, then recreate each other list based on these indexes.
l = [[6.0, 3.0, 16.0, 3.0], [3.0, 2.0, 5.0, 7.0], [4.0, 3.0, 2.0, 1.0]]
indexes = sorted(range(len(l[-1])), key=lambda x:l[-1][x])
res = [[x[i] for i in indexes] for x in l]
One option is to use zip to restructure the list into columnwise tuples, sort them and then turn that back into original lists:
L = [[6.0, 3.0, 16.0, 3.0], [3.0, 2.0, 5.0, 7.0], [4.0, 3.0, 2.0, 1.0]]
R = [*map(list,zip(*sorted(zip(*L[::-1]))))][::-1]
# [[3.0, 16.0, 3.0, 6.0], [7.0, 5.0, 2.0, 3.0], [1.0, 2.0, 3.0, 4.0]]
Another way (much less efficient but perhaps more readable) is to sort each row based on the last row's corresponding values:
R = [ [v for _,v in sorted(zip(L[-1],r))] for r in L ]
>>> a = [[6.0, 3.0, 16.0, 3.0], [3.0, 2.0, 5.0, 7.0], [4.0, 3.0, 2.0, 1.0]]
>>> list(zip(*sorted((zip(*a)), key=lambda x: x[-1])))
[(3.0, 16.0, 3.0, 6.0), (7.0, 5.0, 2.0, 3.0), (1.0, 2.0, 3.0, 4.0)]
I'm using two idioms here:
zip(*list_of_lists) acts as a matrix transposer by swapping rows and columns of the matrix, represented by a list of lists.
sorting the transposed list of lists by the value of the last element.

How to train a LSTM with a sequence of numbers with different lenghts?

I am trying to train a LSTM with a dataset in which both the input and the output are a sequence of numbers of different lenght. Each number in the input represents a timestep. Example of input and output:
Input:
ent
229 [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, ...
511 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, ...
110 [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, ...
243 [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, ...
334 [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, ...
Output:
sal
229 [6.0, 7.0, 3.0, 0.0, 1.0, 4.0, 5.0, 2.0]
511 [0.0, 1.0, 6.0, 7.0, 2.0, 4.0, 5.0, 6.0, 7.0]
110 [3.0, 5.0, 0.0, 1.0, 5.0, 6.0, 7.0, 3.0]
243 [3.0, 6.0, 7.0, 4.0, 6.0, 7.0, 0.0, 1.0, 4.0]
334 [6.0, 7.0, 3.0, 4.0, 3.0, 5.0, 4.0]
When executing the train of the model always appears this error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
model = keras.Sequential()
model.add(layers.Input(shape=(None, 200)))
model.add(layers.LSTM(20))
Should I select a different NN or include padding?
I have also tried changing the dimension to:
ent
229 [[3.0], [3.0], [3.0], [3.0], [3.0], [3.0], [3....
Do you know what could I do?
Traceback (most recent call last):
at block 8, line 8
at /opt/python/envs/default/lib/python3.8/site-packages/keras/utils/traceback_utils.pyline 67, in error_handler(*args, **kwargs)
at /opt/python/envs/default/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.pyline 106, in convert_to_eager_tensor(value, ctx, dtype)
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

I cant properly shape a 2d array for a classifier

I have an issue with the numpy.array method. I'm trying to set up an array of dimensions (73, 125) with my data, but when applying the .array method I get something like this
set arousal (73,) [list([3.0, 4.0, 4.0, 3.0, 5.0, 3.0, 2.0, 4.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 3.0, 3.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 4.0, 4.0, 5.0, 3.0, 3.0, 3.0, 5.0, 2.0, 3.0, 2.0, 4.0, 3.0, 2.0, 3.0, 2.0, 3.0, 2.0, 2.0, 4.0, 3.0, 4.0, 5.0, 4.0, 3.0, 4.0, 4.0, 4.0, 3.0, 5.0, 3.0, 5.0, 2.0, 3.0, 3.0, 2.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 5.0, 4.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 1.0, 2.0]) # etc...
While instead I was expecting something like set arousal (73, 125).
This is my code
# Before this I imported the packages, the relevant datasets and did some preprocessing to drop "bad" data
info_en = info_clean[info_clean['QESTN_LANGUAGE'] == 'ENG']
rating_en = rating_clean[rating_clean['LANGUAGE'] == 'ENG']
info_en_set = info_en.copy()
ratings_set = rating_en.copy()
lArousal = []
lValence = []
for case in case_list:
set = ratings_set[ratings_set['CASE'] == case]
lArousal.append(list(set.loc[:,['AROUSAL_RATING']]['AROUSAL_RATING']))
lValence.append(list(set.loc[:,['VALENCE_RATING_RECODED']]['VALENCE_RATING_RECODED']))
arrArousal = np.asarray(lArousal)
arrValence = np.asarray(lValence)
print('set arousal',arrArousal.shape,arrArousal)
print('set valence',arrValence.shape,arrValence)
When I try to train my sklearn classifier I get the error message "setting an array element with a sequence." that I can understand but I can't solve the list issue.
Apparently, the for loop works for one dataset that I am testing but not for the other. In one case I correctly get the 2d array, in the other, I am stuck with this array of lists.

Python dictionary iteratively expanded into lists

I have the following Python dictionary:
b = {'SP:1': 1.0,
'SP:2': 2.0,
'SP:3': 3.0,
'SP:4': 4.0,
'SP:5': 5.0,
'SP:6': 6.0,
'SP:7': 40.0,
'SP:8': 7.0,
'SP:9': 8.0}
I want to take this list and iterate over it to create 9 lists, each successive list being a superset of its predecessor. So:
[1.0]
[1.0,2.0]
[1.0,2.0,3.0]
[1.0,2.0,3.0,4.0]
...
[1.0,2.0,3.0,4.0,5.0,6.0,40.0,7.0,8.0]
There is probably a really easy way of doing this with a list comprehension, but I cant work it out!
You can do the following:
>>> vals = [v for k, v in sorted(b.items())]
# or shorter, but less explicit:
# vals = [b[k] for k in sorted(b)]
>>> [vals[:i+1] for i in range(len(vals))]
[[1.0],
[1.0, 2.0],
[1.0, 2.0, 3.0],
[1.0, 2.0, 3.0, 4.0],
[1.0, 2.0, 3.0, 4.0, 5.0],
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 40.0],
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 40.0, 7.0],
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 40.0, 7.0, 8.0]]
The first comprehension gives you the values sorted by key as the initial dict is inherently unordered. The second gives you all of the desired slices of that list of values.
Dictionaries are not meant to be used in this form, and should never considered to be ordered. However, since the keys are basically indicies, we can use them like that:
[[b['SP:'+str(j+1)] for j in range(i+1)] for i in range(len(b))]

Tensorflow large tensor split to small tensor

I have a tensor like below
x = tf.Variable(tf.truncated_normal([batch, input]), stddev=0.1))
Assume that batch = 99, input= 5, and I would like to split up into a small tensor.
If x is below:
[[1.0, 2.0, 3.0, 4.0, 5.0]
[2.0, 3.0, 4.0, 5.0, 6.0]
[3.0, 4.0, 5.0, 6.0, 7.0]
[4.0, 5.0, 6.0, 7.0, 8.0]
.........................
.........................
.........................
[44.0, 55.0, 66.0, 77.0, 88.0]
[55.0, 66.0, 77.0, 88.0, 99.0]]
I want to split up into two tensors
[[1.0, 2.0, 3.0, 4.0, 5.0]
[2.0, 3.0, 4.0, 5.0, 6.0]
[3.0, 4.0, 5.0, 6.0, 7.0]]
and
[4.0, 5.0, 6.0, 7.0, 8.0]
.........................
.........................
[44.0, 55.0, 66.0, 77.0, 88.0]
[55.0, 66.0, 77.0, 88.0, 99.0]]
I don't know how to use tf.split to split row.
An expedient way would be to call tf.slice twice.

Categories

Resources