Look up value in an array - python

Suppose I have two datasets
DS1
ArrayCol
[1,2,3,4]
[1,2,3]
DS2
Key Name
1 A
2 B
3 C
4 D
how to look up the values in the array to map the "Name" so that I can have another dataset like the following?
DS3
COlNew
[A,B,C,D]
[A,B,C]
Thanks, it's in databricks, so method is ok . python,sql,scala…...

you can try this
ds1 = [[1, 2, 3, 4], [1, 2, 3]]
ds2 = {1: 'A', 2: 'B', 3: 'C', 4: 'D'}
new_data = [[ds2[cell] for cell in col] for col in ds1]
print(new_data)
output:
[['A', 'B', 'C', 'D'], ['A', 'B', 'C']]
hope that will be help. :)

Lets consider your dataset are in files and you can do something like this,
making use of dict
f=open("ds1.txt").readlines()
g=open("ds2.txt").readlines()
u=dict(item.rstrip().split("\t") for item in g)
for i in f:
i = i.rstrip().strip('][').split(',')
print [u[col] for col in i]
Output
['A', 'B', 'C', 'D']
['A', 'B', 'C']

Related

Filter list using duplicates in another list

I have two equal-length lists, a and b:
a = [1, 1, 2, 4, 5, 5, 5, 6, 1]
b = ['a','b','c','d','e','f','g','h', 'i']
I would like to keep only those elements from b, which correspond to an element in a appearing for the first time. Expected result:
result = ['a', 'c', 'd', 'e', 'h']
One way of reaching this result:
result = [each for index, each in enumerate(b) if a[index] not in a[:index]]
# result will be ['a', 'c', 'd', 'e', 'h']
Another way, invoking Pandas:
import pandas as pd
df = pd.DataFrame(dict(a=a,b=b))
result = list(df.b[~df.a.duplicated()])
# result will be ['a', 'c', 'd', 'e', 'h']
Is there a more efficient way of doing this for large a and b?
You could try if this is faster:
firsts = {}
result = [firsts.setdefault(x, y) for x, y in zip(a, b) if x not in firsts]

Get right label using indices?

Really stupid question as I am new to python:
If I have labels = ['a', 'b', 'c', 'd'],
and indics = [2, 3, 0, 1]
How should I get the corresponding label using each index so I can get: ['c', 'd', 'a', 'b']?
There are a few alternatives, one, is to use a list comprehension:
labels = ['a', 'b', 'c', 'd']
indices = [2, 3, 0, 1]
result = [labels[i] for i in indices]
print(result)
Output
['c', 'd', 'a', 'b']
Basically iterate over each index and fetch the item at that position. The above is equivalent to the following for loop:
result = []
for i in indices:
result.append(labels[i])
A third option is to use operator.itemgetter:
from operator import itemgetter
labels = ['a', 'b', 'c', 'd']
indices = [2, 3, 0, 1]
result = list(itemgetter(*indices)(labels))
print(result)
Output
['c', 'd', 'a', 'b']

Making dynamically sizing if statements

I want to check the columns of a dataframe inside a for loop by using a list, then perform some operations that change the contents of that list for the next iteration. Is it possible to dynamically size the if statement described here.
Example:
df =
a|b|c|d|e
1|2|3|4|5
6|7|8|9|0
check_list = ['a']
for i in range(10):
if check_list in df.columns:
do x
// variable check_list is now equal to ['a','b']
so in the first iteration the list only contains 'a' and in the second iteration it contains 'a' and 'b' and then in further iterations it will be changed further. I hope this adequately explains my question.
Working code that might help answer your question:
def add_column(l, all_columns):
""" Example of a function for adding new columns to the check list"""
if len(l) < len(all_columns):
return l + [all_columns[len(l)]]
else:
return l
all_columns = 'abcde'
df = pd.DataFrame([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]], columns=list(all_columns))
print(df)
check_list = ['a']
for i in range(10):
# issuperset() seems to be the best way to check that the list includes only column names from the dataframe
if set(df.columns).issuperset(set(check_list)):
check_list = add_column(check_list, all_columns)
print("checking list:", check_list)
# do other stuff
Output:
a b c d e
0 1 2 3 4 5
1 6 7 8 9 0
checking list: ['a', 'b']
checking list: ['a', 'b', 'c']
checking list: ['a', 'b', 'c', 'd']
checking list: ['a', 'b', 'c', 'd', 'e']
checking list: ['a', 'b', 'c', 'd', 'e']
checking list: ['a', 'b', 'c', 'd', 'e']
checking list: ['a', 'b', 'c', 'd', 'e']
checking list: ['a', 'b', 'c', 'd', 'e']
checking list: ['a', 'b', 'c', 'd', 'e']
checking list: ['a', 'b', 'c', 'd', 'e']
I hope this helps.

Repeat each elements based on a list of values

Is there a Python builtin that repeats each element of a list based on the corresponding value in another list? For example A in list x position 0 is repeated 2 times because of the value 2 at position 0 in the list y.
>>> x = ['A', 'B', 'C']
>>> y = [2, 1, 3]
>>> f(x, y)
['A', 'A', 'B', 'C', 'C', 'C']
Or to put it another way, what is the fastest way to achieve this operation?
Just use a simple list comprehension:
>>> x = ['A', 'B', 'C']
>>> y = [2, 1, 3]
>>> [x[i] for i in range(len(x)) for j in range(y[i])]
['A', 'A', 'B', 'C', 'C', 'C']
>>>
One way would be the following
x = ['A', 'B', 'C']
y = [2, 1, 3]
s = []
for a, b in zip(x, y):
s.extend([a] * b)
print(s)
result
['A', 'A', 'B', 'C', 'C', 'C']
from itertools import chain
list(chain(*[[a] * b for a, b in zip(x, y)]))
['A', 'A', 'B', 'C', 'C', 'C']
There is itertools.repeat as well, but that ends up being uglier for this particular case.
Try this
x = ['A', 'B', 'C']
y = [2, 1, 3]
newarray = []
for i in range(0,len(x)):
newarray.extend(x[i] * y[i])
print newarray

How do I keep the index of the duplicate element unchanged

Here is a input list:
['a', 'b', 'b', 'c', 'c', 'd']
The output I expect should be:
[[0, 'a'], [1, 'b'], [1, 'b'], [2, 'c'], [2, 'c'], [3, 'd']]
I try to use map()
>>> map(lambda (index, word): [index, word], enumerate([['a', 'b', 'b', 'c', 'c', 'd']])
[[0, 'a'], [1, 'b'], [2, 'b'], [3, 'c'], [4, 'c'], [5, 'd']]
How can I get the expected result?
EDIT: This is not a sorted list, the index of each element increase only when meet a new element
>>> import itertools
>>> seq = ['a', 'b', 'b', 'c', 'c', 'd']
>>> [[i, c] for i, (k, g) in enumerate(itertools.groupby(seq)) for c in g]
[[0, 'a'], [1, 'b'], [1, 'b'], [2, 'c'], [2, 'c'], [3, 'd']]
[
[i, x]
for i, (value, group) in enumerate(itertools.groupby(['a', 'b', 'b', 'c', 'c', 'd']))
for x in group
]
It sounds like you want to rank the terms based on a lexicographical ordering.
input = ['a', 'b', 'b', 'c', 'c', 'd']
mapping = { v:i for (i, v) in enumerate(sorted(set(input))) }
[ [mapping[v], v] for v in input ]
Note that this works for unsorted inputs as well.
If, as your amendment suggests, you want to number items based on order of first appearance, a different approach is in order. The following is short and sweet, albeit offensively hacky:
[ [d.setdefault(v, len(d)), v] for d in [{}] for v in input ]
When list is sorted use groupby (see jamylak answer); when not, just iterate over the list and check if you've seen this letter already:
a = ['a', 'b', 'b', 'c', 'c', 'd']
result = []
d = {}
n = 0
for k in a:
if k not in d:
d[k] = n
n += 1
result.append([d[k],k])
It is the most effective solution; it takes only O(n) time.
Example of usage for unsorted lists:
[[0, 'a'], [1, 'b'], [1, 'b'], [2, 'c'], [2, 'c'], [3, 'd'], [0, 'a']]
As you can see, you have here the same order of items as in the input list.
When you sort the list first you need O(n*log(n)) additional time.

Categories

Resources