Reassigning numpy.array() - python

In the code below, I can easily reduce the array ['a','b','a','c','b','b','c','a'] to a binary array [0 1 0 1 1 1 1 0] so that 'a' -> 0 and 'b','c' -> 1. How do I transform it to a ternary array so that 'a' -> 0, 'b' -> 1, 'c' -> 2, without using for and if-else? Thanks.
import numpy as np
x = np.array(['a', 'b', 'a', 'c', 'b', 'b', 'c', 'a'])
y = np.where(x=='a', 0, 1)
print(y)

By doing:
np.where(x == 'a', 0, (np.where(x == 'b', 1, 2)))
note that this changes all the characters that are neither 'a' or 'b' to 2. I've assumed that you have only an array with a,b and c.

A more scalable version is using dictionary of conversion:
my_dict = {'a':0, 'b':1, 'c':2}
x = np.vectorize(my_dict.get)(x)
output:
[0 1 0 2 1 1 2 0]
Another approach is:
np.select([x==i for i in ['a','b','c']], np.arange(3))
For small dictionary #ypno's answer is going to be faster. For larger dictionary, use this answer.
Time Comparison:
Ternary alphabet:
lst = ['a','b','c']
my_dict = {k: v for v, k in enumerate(lst)}
##Ehsan's solution1
def m1(x):
return np.vectorize(my_dict.get)(x)
##ypno's solution
def m2(x):
return np.where(x == 'a', 0, (np.where(x == 'b', 1, 2)))
##SteBog's solution
def m3(x):
y = np.where(x=='a', 0, x)
y = np.where(x=='b', 1, y)
y = np.where(x=='c', 2, y)
return y.astype(np.integer)
##Ehsan's solution 2 (also suggested by user3483203 in comments)
def m4(x):
return np.select([x==i for i in lst], np.arange(len(lst)))
##juanpa.arrivillaga's solution suggested in comments
def m5(x):
return np.array([my_dict[i] for i in x.tolist()])
in_ = [np.random.choice(lst, size = n) for n in [10,100,1000,10000,100000]]
Same analysis for 8 letter alphabet:
lst = ['a','b','c','d','e','f','g','h']

Related

Pandas Multi-index set value based on three different condition

The objective is to create a new multiindex column based on 3 conditions of the column (B)
Condition for B
if B<0
CONDITION_B='l`
elif B<-1
CONDITION_B='L`
else
CONDITION_B='g`
Naively, I thought, we can simply create two different mask and replace the value as suggested
# Handle CONDITION_B='l` and CONDITION_B='g`
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
and then
# CONDITION_B='L`
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
As expected, this will throw an error
TypeError: sequence item 1: expected str instance, bool found
May I know how to handle the 3 different condition
Expected output
ONE TWO
B B
g L
l l
l g
g l
L L
The code to produce the error is
import pandas as pd
import numpy as np
np.random.seed(3)
arrays = [np.hstack([['One']*2, ['Two']*2]) , ['A', 'B', 'A', 'B']]
columns = pd.MultiIndex.from_arrays(arrays)
df= pd.DataFrame(np.random.randn(5, 4), columns=list('ABAB'))
df.columns = columns
idx = pd.IndexSlice
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
IIUC:
np.select() is ideal in this case:
conditions=[
df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),
df.loc[:,idx[:,'B']].lt(-1),
df.loc[:,idx[:,'B']].ge(0)
]
labels=['l','L','g']
out=pd.DataFrame(np.select(conditions,labels),columns=df.loc[:,idx[:,'B']].columns)
OR
via np.where():
s=np.where(df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),'l',np.where(df.loc[:,idx[:,'B']].lt(-1),'L','g'))
out=pd.DataFrame(s,columns=df.loc[:,idx[:,'B']].columns)
output of out:
One Two
B B
0 g L
1 l l
2 l g
3 g l
4 L L
I don't fully understand what you want to do but try something like this:
df = pd.DataFrame({'B': [ 0, -1, -2, -2, -1, 0, 0, -1, -1, -2]})
df['ONE'] = np.where(df['B'] < 0, 'l', 'g')
df['TWO'] = np.where(df['B'] < -1, 'L', df['ONE'])
df = df.set_index(['ONE', 'TWO'])
Output result:
>>> df
B
ONE TWO
g g 0
l l -1
L -2
L -2
l -1
g g 0
g 0
l l -1
l -1
L -2

Create an adjacency matrix using a dictionary with letter values converted to numbers in python

So I have a dictionary with letter values and keys and I want to generate an adjacency matrix using digits (0 or 1). But I don't know how to do that.
Here is my dictionary:
g = { "a" : ["c","e","b"],
"b" : ["f","a"]}
And I want an output like this :
import numpy as np
new_dic = {'a':[0,1,1,0,1,0],'b':(1,0,0,0,0,1)}
rows_names = ['a','b'] # I use a list because dictionaries don't memorize the positions
adj_matrix = np.array([new_dic[i] for i in rows_names])
print(adj_matrix)
Output :
[[0 1 1 0 1 0]
[1 0 0 0 0 1]]
So it's an adjacency matrix: column/row 1 represent A, column/row 2 represent B ...
Thank you !
I don't know if it helps but here is how I convert all letters to numbers using ascii :
for key, value in g.items():
nums = [str(ord(x) - 96) for x in value if x.lower() >= 'a' and x.lower() <= 'z']
g[key] = nums
print(g)
Output :
{'a': ['3', '5', '2'], 'b': ['6', '1']}
So a == 1 b == 2 ...
So my problem is: If a take the keys a with the first value "e", how should I do so that the e is found in the column 5 line 1 and not in the column 2 line 1 ? and replacing the e to 1
Using comprehensions:
g = {'a': ['c', 'e', 'b'], 'b': ['f', 'a']}
vals = 'a b c d e f'.split() # Column values
new_dic = {k: [1 if x in v else 0 for x in vals] for k, v in g.items()}

Implementing k nearest neighbours from distance matrix?

I am trying to do the following:
Given a dataFrame of distance, I want to identify the k-nearest neighbours for each element.
Example:
A B C D
A 0 1 3 2
B 5 0 2 2
C 3 2 0 1
D 2 3 4 0
If k=2, it should return:
A: B D
B: C D
C: D B
D: A B
Distances are not necessarily symmetric.
I am thinking there must be something somewhere that does this in an efficient way using Pandas DataFrames. But I cannot find anything?
Homemade code is also very welcome! :)
Thank you!
The way I see it, I simply find n + 1 smallest numbers/distances/neighbours for each row and remove the 0, which would then give you n numbers/distances/neighbours. Keep in mind that the code will not work if you have a distance of zeroes! Only the diagonals are allowed to be 0.
import pandas as pd
import numpy as np
X = pd.DataFrame([[0, 1, 3, 2],[5, 0, 2, 2],[3, 2, 0, 1],[2, 3, 4, 0]])
X.columns = ['A', 'B', 'C', 'D']
X.index = ['A', 'B', 'C', 'D']
X = X.T
for i in X.index:
Y = X.nsmallest(3, i)
Y = Y.T
Y = Y[Y.index.str.startswith(i)]
Y = Y.loc[:, Y.any()]
for j in Y.index:
print(i + ": ", list(Y.columns))
This prints out:
A: ['B', 'D']
B: ['C', 'D']
C: ['D', 'B']
D: ['A', 'B']

Creating a subset of array from another array : Python

I have a basic question regarding working with arrays:
a= ([ c b a a b b c a a b b c a a b a c b])
b= ([ 0 1 0 1 0 0 0 0 2 0 1 0 2 0 0 1 0 1])
I) Is there a short way, to count the number of time 'c' in a corresponds to 0, 1, and 2 in b and 'b' in a corresponds to 0, 1, 2 and so on
II) How do I create a new array c (subset of a) and d(subset of b) such that it only contains those elements if the corresponding element in a is 'c' ?
In [10]: p = ['a', 'b', 'c', 'a', 'c', 'a']
In [11]: q = [1, 2, 1, 3, 3, 1]
In [12]: z = zip(p, q)
In [13]: z
Out[13]: [('a', 1), ('b', 2), ('c', 1), ('a', 3), ('c', 3), ('a', 1)]
In [14]: counts = {}
In [15]: for pair in z:
...: if pair in counts.keys():
...: counts[pair] += 1
...: else:
...: counts[pair] = 1
...:
In [16]: counts
Out[16]: {('a', 1): 2, ('a', 3): 1, ('b', 2): 1, ('c', 1): 1, ('c', 3): 1}
In [17]: sub_p = []
In [18]: sub_q = []
In [19]: for i, element in enumerate(p):
...: if element == 'a':
...: sub_p.append(element)
...: sub_q.append(q[i])
In [20]: sub_p
Out[20]: ['a', 'a', 'a']
In [21]: sub_q
Out[21]: [1, 3, 1]
Explanation
zip takes two lists and runs a figurative zipper between them. Resulting in a list of tuples
I've used a simplistic approach, I'm just maintaining a map/dictionary that makes not of how many times it has seen a pair of char-int tuples
Then I make 2 sub lists that you can modify to use the character in question and figure out what it maps to
Alternative methods
As abarnert suggested you could use A Counter from collections instead.
Or you could just a count method on z . eg: z.count('a',1). Or you can use a defaultdict instead.
The questions are a bit vague but here's a quick method (some would call it dirty) using Pandas though I think something written without recourse to Pandas should be preferred.
import pandas as pd
#create OP's lists
a= [ 'c', 'b', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'a', 'c', 'b']
b= [ 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 1]
#dump lists to a Pandas DataFrame
df = pd.DataFrame({'a':a, 'b':b})
Question 1
provided I interpreted it correctly, you can cross-tabulate the two arrays:
pd.crosstab(df.a, df.b).stack(). Cross-tabulate basically counts the number of times each number corresponds to a particular letter. .stack is a command to turn output from .crosstab into a more legible format.
#question 1
pd.crosstab(df.a, df.b).stack()
## -- End pasted text --
Out[9]:
a b
a 0 3
1 2
2 2
b 0 4
1 3
2 0
c 0 4
1 0
2 0
dtype: int64
Question 2
Here, I use Pandas boolean indexing ability to only select the elements in array a that correspond to value 'c'. So df.a=='c' will return True for every value in a that is 'c' and False otherwise. df.loc[df.a=='c','a'] will return values from a for which the boolean statement was true.
c = df.loc[df.a == 'c', 'a']
d = df.loc[df.a == 'c', 'b']
In [15]: c
Out[15]:
0 c
6 c
11 c
16 c
Name: a, dtype: object
In [16]: d
Out[16]:
0 0
6 0
11 0
16 0
Name: b, dtype: int64
Python List : https://www.tutorialspoint.com/python/python_lists.htm has a count method.
I suggest you to first zip both lists, as said in comments, and then count occurances of tuple c, 1 and occurances of tuple c, 0 and sum them up, thats what you need for (I), basically.
For (II), if I understood you correctly, you have to take the zipped lists and apply filter on them with lambda x: x[0]==x[1]

python compare items in 2 list of different length - order is important

list_1 = ['a', 'a', 'a', 'b']
list_2 = ['a', 'b', 'b', 'b', 'c']
so in the list above, only items in index 0 is the same while index 1 to 4 in both list are different. also, list_2 has an extra item 'c'.
I want to count the number of times the index in both list are different, In this case I should get 3.
I tried doing this:
x = 0
for i in max(len(list_1),len(list_2)):
if list_1[i]==list_2[i]:
continue
else:
x+=1
I am getting an error.
Use the zip() function to pair up the lists, counting all the differences, then add the difference in length.
zip() will only iterate over the items that can be paired up, but there is little point in iterating over the remainder; you know those are all to be counted as different:
differences = sum(a != b for a, b in zip(list_1, list_2))
differences += abs(len(list_1) - len(list_2))
The sum() sums up True and False values; this works because Python's boolean type is a subclass of int and False equals 0, True equals 1. Thus, for each differing pair of elements, the True values produced by the != tests add up as 1s.
Demo:
>>> list_1 = ['a', 'a', 'a', 'b']
>>> list_2 = ['a', 'b', 'b', 'b', 'c']
>>> sum(a != b for a, b in zip(list_1, list_2))
2
>>> abs(len(list_1) - len(list_2))
1
>>> difference = sum(a != b for a, b in zip(list_1, list_2))
>>> difference += abs(len(list_1) - len(list_2))
>>> difference
3
You can try with this :
list1 = [1,2,3,5,7,8,23,24,25,32]
list2 = [5,3,4,21,201,51,4,5,9,12,32,23]
list3 = []
for i in range(len(list2)):
if list2[i] not in list1:
pass
else :
list3.append(list2[i])
print list3
print len(list3)
As ZdaR commented, you should get 3 as the result and zip_longest can help here if you don't have Nones in the lists.
from itertools import zip_longest
list_1=['a', 'a', 'a', 'b']
list_2=['a', 'b', 'b', 'b', 'c']
x = sum(a != b for a,b in zip_longest(list_1,list_2))
Can i try this way using for loop:
>>> count = 0
>>> ls1 = ['a', 'a', 'a', 'b']
>>> ls2 = ['a', 'b', 'b', 'b', 'c']
>>> for i in range(0, max(len(ls1),len(ls2)), 1):
... if ls1[i:i+1] != ls2[i:i+1]:
... count += 1
...
>>> print count
3
>>>
Or try this (didn't change the lists):
dif = 0
for i in range(len(min(list_1, list_2))):
if list_1[i]!=list_2[i]:
dif+=1
#print(list_1[i], " != ", list_2[i], " --> Dif = ", dif)
dif+=(len(max(list_1, list_2)) - len(min(list_1, list_2)))
print("Difference = ", dif)
(Output: Difference = 3)
Not much better, but here's another option
if len(a) < len(b):
b = b[0:len(a)]
else:
a = a[0:len(b)]
correct = sum(a == b)

Categories

Resources