Related
I have two equal-length lists, a and b:
a = [1, 1, 2, 4, 5, 5, 5, 6, 1]
b = ['a','b','c','d','e','f','g','h', 'i']
I would like to keep only those elements from b, which correspond to an element in a appearing for the first time. Expected result:
result = ['a', 'c', 'd', 'e', 'h']
One way of reaching this result:
result = [each for index, each in enumerate(b) if a[index] not in a[:index]]
# result will be ['a', 'c', 'd', 'e', 'h']
Another way, invoking Pandas:
import pandas as pd
df = pd.DataFrame(dict(a=a,b=b))
result = list(df.b[~df.a.duplicated()])
# result will be ['a', 'c', 'd', 'e', 'h']
Is there a more efficient way of doing this for large a and b?
You could try if this is faster:
firsts = {}
result = [firsts.setdefault(x, y) for x, y in zip(a, b) if x not in firsts]
Given a list of string,
['a', 'a', 'c', 'a', 'a', 'a', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'b', 'b', 'b', 'd', 'b', 'b', 'b']
I would like to convert to an integer-category form
[0, 0, 2, 0, 0, 0, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 1, 1, 1, 3, 1, 1, 1]
This can achieve using numpy unique as below
ipt=['a', 'a', 'c', 'a', 'a', 'a', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'b', 'b', 'b', 'd', 'b', 'b', 'b']
_, opt = np.unique(np.array(ipt), return_inverse=True)
But, I curious if there is another alternative without the need to import numpy.
If you are solely interested in finding integer representation of factors, then you can use a dict comprehension along with enumerate to store the mapping, after using set to find unique values:
lst = ['a', 'a', 'c', 'a', 'a', 'a', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'b', 'b', 'b', 'd', 'b', 'b', 'b']
d = {x: i for i, x in enumerate(set(lst))}
lst_new = [d[x] for x in lst]
print(lst_new)
# [3, 3, 0, 3, 3, 3, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 1, 1, 1, 2, 1, 1, 1]
This approach can be used for general factors, i.e., the factors do not have to be 'a', 'b' and so on, but can be 'dog', 'bus', etc. One drawback is that it does not care about the order of factors. If you want the representation to preserve order, you can use sorted:
d = {x: i for i, x in enumerate(sorted(set(lst)))}
lst_new = [d[x] for x in lst]
print(lst_new)
# [0, 0, 2, 0, 0, 0, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 1, 1, 1, 3, 1, 1, 1]
You could take a note out of the functional programming book:
ipt=['a', 'a', 'c', 'a', 'a', 'a', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'b', 'b', 'b', 'd', 'b', 'b', 'b']
opt = list(map(lambda x: ord(x)-97, ipt))
This code iterates through the input array and passes each element through the lambda function, which takes the ascii value of the character, and subtracts 97 (to convert the characters to 0-25).
If each string isn't a single character, then the lambda function may need to be adapted.
You could write a custom function to do the same thing as you are using numpy.unique() for.
def unique(my_list):
''' Takes a list and returns two lists, a list of each unique entry and the index of
each unique entry in the original list
'''
unique_list = []
int_cat = []
for item in my_list:
if item not in unique_list:
unique_list.append(item)
int_cat.append(unique_list.index(item))
return unique_list, int_cat
Or if you wanted your indexing to be ordered.
def unique_ordered(my_list):
''' Takes a list and returns two lists, an ordered list of each unique entry and the
index of each unique entry in the original list
'''
# Unique list
unique_list = []
for item in my_list:
if item not in unique_list:
unique_list.append(item)
# Sorting unique list alphabetically
unique_list.sort()
# Integer category list
int_cat = []
for item in my_list:
int_cat.append(unique_list.index(item))
return unique_list, int_cat
Comparing the computation time for these two vs numpy.unique() for 100,000 iterations of your example list, we get:
numpy = 2.236004s
unique = 0.460719s
unique_ordered = 0.505591s
Showing that either option would be faster than numpty for simple lists. More complicated strings decrease the speed of unique() and unique_ordered much more than numpy.unique(). Doing 10,000 iterations of a random, 100 element list of 20 character strings, we get times of:
numpy = 0.45465s
unique = 1.56963s
unique_ordered = 1.59445s
So if efficiency was important and your list had more complex/a larger variety of strings, it would likely be better to use numpy.unique()
I have seen similar questions to mine, but nothing I researched really fixed my issue.
So, basically I want to split a list, in order to remove some items and concatenate it back. Those items correspond to indexes that are given by a list of tuples.
import numpy as np
arr = ['x','y','z','a','b','c','d','e','f','g',2,3,4]
indices = [(2,4),(7,9)] #INDEXES THAT NEED TO BE CUT OUT
print ([list1[0:s] +list1[s+1:e] for s,e in indices])
#Returns: [['x', 'y', 'z', 'a'], ['x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f']]
This code I have, which I got from one of the answers from this post nearly does what I need, but I tried to adapt it to loop over the first index of indices once but instead it does twice and it doesn't include the rest of the list.
I want my final list to split from zero index to the first item on first tuple and so on, using a for loop or some iterator.
Something like this,
`final_arr = arr[0:indices[0][0]] + arr[indices[0][1]:indices[1][0]] + arr[indices[1][1]:]<br/>
#Returns: [['x','y','a','b','c','f','g',2,3,4]]`
If someone could do it using for loops, it would be easier for me to see how you understand the problem, then after I can try to adapt to using shorter code.
Sort the indices using sorted and del the slices. You need reverse=True otherwise the indices of the later slices are incorrect.
for x, y in sorted(indices, reverse=True):
del(arr[x:y])
print(arr)
>>> ['x', 'y', 'b', 'c', 'd', 'g', 2, 3, 4]
This is the same result as you get with
print(arr[0:indices[0][0]] + arr[indices[0][1]:indices[1][0]] + arr[indices[1][1]:])
>>> ['x', 'y', 'b', 'c', 'd', 'g', 2, 3, 4]
arr = ['x','y','z','a','b','c','d','e','f','g',2,3,4]
indices = [(2,4),(7,9)] #INDEXES THAT NEED TO BE CUT OUT
import itertools
ignore = set(itertools.chain.from_iterable(map(lambda i: range(*i), indices)))
out = [c for idx, c in enumerate(arr) if idx not in ignore]
print(out)
print(arr[0:indices[0][0]] + arr[indices[0][1]:indices[1][0]] + arr[indices[1][1]:])
Output,
['x', 'y', 'b', 'c', 'd', 'g', 2, 3, 4]
['x', 'y', 'b', 'c', 'd', 'g', 2, 3, 4]
Like this:
import numpy as np
arr = ['x','y','z','a','b','c','d','e','f','g',2,3,4]
indices = [(2,4),(7,9)] #INDEXES THAT NEED TO BE CUT OUT
print ([v for t in indices for i,v in enumerate(arr) if i not in range(t[0],t[1])])
Output:
['x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 2, 3, 4, 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 2, 3, 4]
1- If you can remove the list items:
I using the example for JimithyPicker. I change the index list (removed items), because always that one index was removed the size of list change.
arr = ['x','y','z','a','b','c','d','e','f','g',2,3,4]
indices = [2,5,5] #INDEXES THAT NEED TO BE CUT OUT
for index in indices:
arr.pop(index)
final_arr = [arr]
print(final_arr)
Output:
[['x', 'y', 'a', 'b', 'c', 'f', 'g', 2, 3, 4]]
2- If you can't remove items:
In this case is necessary change the second index! The number doesn't match with output that you want.
The indices = [(2,4),(7,9)] has the output: ['x', 'y', 'a', 'b', 'c', 'd', 'f', 'g', 2, 3, 4]
arr = ['x','y','z','a','b','c','d','e','f','g',2,3,4]
indices = [(2,4),(6,9)] #INDEXES THAT NEED TO BE CUT OUT
final_arr = arr[0:indices[0][0]] + arr[indices[0][1]-1:indices[1][0]] + arr[indices[1][1]-1:]
print(final_arr)
Output:
['x','y','a','b','c','f','g',2,3,4]
In a Python list, how can I map all instances of one value to another value?
For example, suppose I have this list:
x = [1, 3, 3, 2, 3, 1, 2]
Now, perhaps I want to change all 1's to 'a', all 2's to 'b', and all 3's to 'c', to create another list:
y = ['a', 'c', 'c', 'b', 'c', 'a', 'b']
How can I do this mapping elegantly?
You should use a dictionary and a list comprehension:
>>> x = [1, 3, 3, 2, 3, 1, 2]
>>> d = {1: 'a', 2: 'b', 3: 'c'}
>>> [d[i] for i in x]
['a', 'c', 'c', 'b', 'c', 'a', 'b']
>>>
>>> x = [True, False, True, True, False]
>>> d = {True: 'a', False: 'b'}
>>> [d[i] for i in x]
['a', 'b', 'a', 'a', 'b']
>>>
The dictionary serves as a translation table of what gets converted into what.
An alternative solution is to use the built-in function map which applies a function to a list:
>>> x = [1, 3, 3, 2, 3, 1, 2]
>>> subs = {1: 'a', 2: 'b', 3: 'c'}
>>> list(map(subs.get, x)) # list() not needed in Python 2
['a', 'c', 'c', 'b', 'c', 'a', 'b']
Here the dict.get method was applied to the list x and each number was exchanged for its corresponding letter in subs.
In [255]: x = [1, 3, 3, 2, 3, 1, 2]
In [256]: y = ['a', 'c', 'c', 'b', 'c', 'a', 'b']
In [257]: [dict(zip(x,y))[i] for i in x]
Out[257]: ['a', 'c', 'c', 'b', 'c', 'a', 'b']
Here is a input list:
['a', 'b', 'b', 'c', 'c', 'd']
The output I expect should be:
[[0, 'a'], [1, 'b'], [1, 'b'], [2, 'c'], [2, 'c'], [3, 'd']]
I try to use map()
>>> map(lambda (index, word): [index, word], enumerate([['a', 'b', 'b', 'c', 'c', 'd']])
[[0, 'a'], [1, 'b'], [2, 'b'], [3, 'c'], [4, 'c'], [5, 'd']]
How can I get the expected result?
EDIT: This is not a sorted list, the index of each element increase only when meet a new element
>>> import itertools
>>> seq = ['a', 'b', 'b', 'c', 'c', 'd']
>>> [[i, c] for i, (k, g) in enumerate(itertools.groupby(seq)) for c in g]
[[0, 'a'], [1, 'b'], [1, 'b'], [2, 'c'], [2, 'c'], [3, 'd']]
[
[i, x]
for i, (value, group) in enumerate(itertools.groupby(['a', 'b', 'b', 'c', 'c', 'd']))
for x in group
]
It sounds like you want to rank the terms based on a lexicographical ordering.
input = ['a', 'b', 'b', 'c', 'c', 'd']
mapping = { v:i for (i, v) in enumerate(sorted(set(input))) }
[ [mapping[v], v] for v in input ]
Note that this works for unsorted inputs as well.
If, as your amendment suggests, you want to number items based on order of first appearance, a different approach is in order. The following is short and sweet, albeit offensively hacky:
[ [d.setdefault(v, len(d)), v] for d in [{}] for v in input ]
When list is sorted use groupby (see jamylak answer); when not, just iterate over the list and check if you've seen this letter already:
a = ['a', 'b', 'b', 'c', 'c', 'd']
result = []
d = {}
n = 0
for k in a:
if k not in d:
d[k] = n
n += 1
result.append([d[k],k])
It is the most effective solution; it takes only O(n) time.
Example of usage for unsorted lists:
[[0, 'a'], [1, 'b'], [1, 'b'], [2, 'c'], [2, 'c'], [3, 'd'], [0, 'a']]
As you can see, you have here the same order of items as in the input list.
When you sort the list first you need O(n*log(n)) additional time.