Trimming numpy arrays: what is the best method? - python

Consider the following code:
a = np.arange (1,6)
b = np.array(["A", "B", "C", "D", "E"])
c = np.arange (21, 26)
a,b,c = a[a> 3],b[a>3], c[a >3]
print a,b,c
The output is: [4 5] ['D' 'E'] [24 25]
I cant' figure out why this output is different from the following:
a = np.arange (1,6)
b = np.array(["A", "B", "C", "D", "E"])
c = np.arange (21, 26)
a = a[a>3]
b = b[a>3]
c = c[a>3]
print a,b,c
output:
[4 5] ['A' 'B'] [21 22]
Any idea?

In the first part, when you do:
a, b, c = a[a> 3], b[a>3], c[a >3]
it is done over a = np.arange (1,6) - The value of a is only modified after all operations have been executed.
whereas in the second part, you are filtering b and c over an already filtered and modified array a, because it happens after you have done:
a = a[a>3]
Therefore, the following lines are filtered against array a now equal to [4, 5]
b = b[a>3] # <-- over a = [4, 5] gives values at index 0 and 1
c = c[a>3] # <-- over a = [4, 5] gives values at index 0 and 1
In the second case, you could use a temporary array to hold the filtered values of a.
temp = a[a>3]
b = b[a>3]
c = c[a>3]
a = temp
or, as suggested in the comments by #hpaulj, evaluate and store the mask in a variable first, then use it as many times as needed without having to redo the work:
mask = a > 3
a = a[mask]
b = b[mask]
c = c[~mask]

A simple fix is to trim your "a" array last, not first!
b=b[a>3]
c=c[a>3]
a=a[a>3]
If you plan to perform multiple trimmings, then consider saving the [a>3] to a variable temporarily (as instructed by other answer) which may help improve computational efficiency.

Related

In python numpy, how to replace some rows in array A with array B if we know the index

In python numpy, how to replace some rows in array A with array B if we know the index.
For example
we have
a = np.array([[1,2],[3,4],[5,6]])
b = np.array([[10,10],[1000, 1000]])
index = [0,2]
I want to change a to
a = np.array([[10,10],[3,4],[1000,1000]])
I have considered the funtion np.where but it need to create the bool condition, not very convenient,
I would do it following way
import numpy as np
a = np.array([[1,2],[3,4],[5,6]])
b = np.array([[10,10],[1000, 1000]])
index = [0,2]
a[index] = b
print(a)
gives output
[[ 10 10]
[ 3 4]
[1000 1000]]
You can use :
a[index] = b
For example :
import numpy as np
a = np.array([[1,2],[3,4],[5,6]])
b = np.array([[10,10],[1000, 1000]])
index = [0,2]
a[index] = b
print(a)
Result :
[[ 10 10]
[ 3 4]
[1000 1000]]
In Python's NumPy library, you can use the numpy.put() method to replace some rows in array A with array B if you know the index. Here's an example:
import numpy as np
# Initialize array A
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Initialize array B
B = np.array([[10, 20, 30], [40, 50, 60]])
# Indices of the rows to be replaced in array A
indices = [0, 1]
# Replace rows in array A with rows in array B
np.put(A, indices, B)
print(A)
In this example, the first two rows in array A are replaced with the first two rows in array B, so the output will be
[[10 20 30]
[40 50 60]
[ 7 8 9]]
Simply a[indices] = b or if you want to be more fancy np.put(a, indices, b)

Pandas Multi-index set value based on three different condition

The objective is to create a new multiindex column based on 3 conditions of the column (B)
Condition for B
if B<0
CONDITION_B='l`
elif B<-1
CONDITION_B='L`
else
CONDITION_B='g`
Naively, I thought, we can simply create two different mask and replace the value as suggested
# Handle CONDITION_B='l` and CONDITION_B='g`
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
and then
# CONDITION_B='L`
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
As expected, this will throw an error
TypeError: sequence item 1: expected str instance, bool found
May I know how to handle the 3 different condition
Expected output
ONE TWO
B B
g L
l l
l g
g l
L L
The code to produce the error is
import pandas as pd
import numpy as np
np.random.seed(3)
arrays = [np.hstack([['One']*2, ['Two']*2]) , ['A', 'B', 'A', 'B']]
columns = pd.MultiIndex.from_arrays(arrays)
df= pd.DataFrame(np.random.randn(5, 4), columns=list('ABAB'))
df.columns = columns
idx = pd.IndexSlice
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
IIUC:
np.select() is ideal in this case:
conditions=[
df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),
df.loc[:,idx[:,'B']].lt(-1),
df.loc[:,idx[:,'B']].ge(0)
]
labels=['l','L','g']
out=pd.DataFrame(np.select(conditions,labels),columns=df.loc[:,idx[:,'B']].columns)
OR
via np.where():
s=np.where(df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),'l',np.where(df.loc[:,idx[:,'B']].lt(-1),'L','g'))
out=pd.DataFrame(s,columns=df.loc[:,idx[:,'B']].columns)
output of out:
One Two
B B
0 g L
1 l l
2 l g
3 g l
4 L L
I don't fully understand what you want to do but try something like this:
df = pd.DataFrame({'B': [ 0, -1, -2, -2, -1, 0, 0, -1, -1, -2]})
df['ONE'] = np.where(df['B'] < 0, 'l', 'g')
df['TWO'] = np.where(df['B'] < -1, 'L', df['ONE'])
df = df.set_index(['ONE', 'TWO'])
Output result:
>>> df
B
ONE TWO
g g 0
l l -1
L -2
L -2
l -1
g g 0
g 0
l l -1
l -1
L -2

Construct an assignment matrix - Python

I have two lists of element
a = [1,2,3,2,3,1,1,1,1,1]
b = [3,1,2,1,2,3,3,3,3,3]
and I am trying to uniquely match the element from a to b, my expected result is like this:
1: 3
2: 1
3: 2
So I tried to construct an assignment matrix and then use scipy.linear_sum_assignment
a = [1,2,3,2,3,1,1,1,1,1]
b = [3,1,2,1,2,3,3,3,3,3]
total_true = np.unique(a)
total_pred = np.unique(b)
matrix = np.zeros(shape=(len(total_pred),
len(total_true)
)
)
for n, i in enumerate(total_true):
for m, j in enumerate(total_pred):
matrix[n, m] = sum(1 for item in b if item==(i))
I expected the matrix to be:
1 2 3
1 0 2 0
2 0 0 2
3 6 0 0
But the output is:
[[2. 2. 2.]
[2. 2. 2.]
[6. 6. 6.]]
What mistake did I made in here? Thank you very much
You don't even need to process this by Pandas. try to use zip and dict:
In [42]: a = [1,2,3,2,3,1,1,1,1,1]
...: b = [3,1,2,1,2,3,3,3,3,3]
...:
In [43]: c =zip(a,b)
In [44]: dict(c)
Out[44]: {1: 3, 2: 1, 3: 2}
UPDATE as OP said, if we need to store all the value with the same key, we can use defaultdict:
In [58]: from collections import defaultdict
In [59]: d = defaultdict(list)
In [60]: for k,v in c:
...: d[k].append(v)
...:
In [61]: d
Out[61]: defaultdict(list, {1: [3, 3, 3, 3, 3, 3], 2: [1, 1], 3: [2, 2]})
This row:
matrix[n, m] = sum(1 for item in b if item==(i))
counts the occurrences of i in b and saves the result to matrix[n, m]. Each cell of the matrix will contain either the number of 1's in b (i.e. 2) or the number of 2's in b (i.e. 2) or the number of 3's in b (i.e. 6). Notice that this value is completely independent of j, which means that the values in one row will always be the same.
In order to take j into consideration, try to replace the row with:
matrix[n, m] = sum(1 for x, y in zip(a, b) if (x, y) == (j, i))
In case your expected output, since how we specify the matrix as a(i, j) with i is the index of the row, and j is the index of the col. Looking at a(3,1) in your matrix, the result is 6, which means (3,1) combination matches 6 times, with 3 is from b and 1 is from a. We can find all the matches from 2 list.
matches = [tuple([x, y]) for x,y in zip(b, a)]
Then we can find how many matches there are of a specific combination, for example a(3, 1).
result = matches.count((3,1))

Python Pandas declaration empty DataFrame only with columns

I need to declare empty dataframe in python in order to append it later in a loop. Below the line of the declaration:
result_table = pd.DataFrame([[], [], [], [], []], columns = ["A", "B", "C", "D", "E"])
It throws an error:
AssertionError: 5 columns passed, passed data had 0 columns
Why is it so? I tried to find out the solution, but I failed.
import pandas as pd
df = pd.DataFrame(columns=['A','B','C','D','E'])
That's it!
Because you actually pass no data. Try this
result_frame = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e'])
if you then want to add data, use
result_frame.loc[len(result_frame)] = [1, 2, 3, 4, 5]
I think it is better to create a list or tuples or lists and then call DataFrame only once:
L = []
for i in range(3):
#some random data
a = 1
b = i + 2
c = i - b
d = i
e = 10
L.append((a, b, c, d, e))
print (L)
[(1, 2, -2, 0, 10), (1, 3, -2, 1, 10), (1, 4, -2, 2, 10)]
result_table = pd.DataFrame(L, columns = ["A", "B", "C", "D", "E"])
print (result_table)
A B C D E
0 1 2 -2 0 10
1 1 3 -2 1 10
2 1 4 -2 2 10

Why is the index only changing when I use different values?

I just started programming in Python, and I can't figure out how to make the index change if I want the values in the list to be the same. What I want is for the index to change, so it will print 0, 1, 2, but all I get is 0, 0, 0. I tried to change the values of the list so that they were different, and then I got the output I wanted. But I don't understand why it matters what kind of values I use, why would the index care about what is in the list?
a = 0
b = 0
c = 0
d = 0
e = 0
f = 0
justTesting = [[a, b], [c, d], [e, f]]
for item in justTesting:
something = justTesting.index(item)
print (something)
I'm using python 3.6.1 if that mattters
Because each list (designated 'item' in your loop) is [0, 0] this means the line:
something = justTesting.index(item)
will look for the first instance of the list [0, 0] in the list for each 'item' during the iteration. As every item in the list is [0, 0] the first instance is at position 0.
I have prepared an alternative example to illustrate the point
a = 1
b = 2
c = 3
d = 4
e = 5
f = 6
justTesting = [[a, b], [c, d], [e, f]]
for item in justTesting:
print(item)
something = justTesting.index(item)
print(something)
This results in the following:
[1, 2]
0
[3, 4]
1
[5, 6]
2
It's because your list only contains [0, 0]!
So basically, if we replace all the variables with their values, we get:
justTesting = [[0, 0], [0, 0], [0, 0]]
And using .index(item) will return the first occurrence of item if any. Since item is always [0, 0] and it first appears at justTesting[0], you will always get 0! Try changing up the values in each list and try again. For example, this works:
b = [1, 2, 3, 4, 5, 6, 7, 8, 9]
for item in b:
print(b.index(item))
Which returns:
0, 1, 2, 3, 4, 5, 6, 7, 8
if the results were on a single line.
Try it here!
Read the documentation: the default for index is to identify the first occurence. You need to use the start parameter as well, updating as you go: search only the list after the most recent find.
something = justTesting.index(item, something+1)
That's because you are iterating over a list of lists.
Every item is actually a list, and you are executing list.index() method which returns the index of the element in the list.
This is a little tricky. Since you actually have 3 lists, of [0, 0] their values will be the same when testing for equality:
>>> a = 0
>>> b = 0
>>> c = 0
>>> d = 0
>>> ab = [a, b]
>>> cd = [c, d]
>>>
>>> ab is cd
False
>>> ab == cd
True
>>>
Now when you run list.index(obj) you are looking for the 1st index that matches the object. Your code actually runs list.index([0, 0]) 3 times and returns the first match, which is at index 0.
Put different values inside a, b, c lists and it would work as you expect.
Your code:
a = 0
b = 0
c = 0
d = 0
e = 0
f = 0
justTesting = [[a, b], [c, d], [e, f]]
for item in justTesting:
something = justTesting.index(item)
print (something)
is equivalent to:
a = 0
b = 0
c = 0
d = 0
e = 0
f = 0
ab = [a, b]
cd = [c, d]
ef = [e, f]
justTesting = [ab, cd, ef]
# Note that ab == cd is True and cd == ef is True
# so all elements of justTesting are identical.
#
# for item in justTesting:
# something = justTesting.index(item)
# print (something)
#
# is essentially equivalent to:
item = justTesting[0] # = ab = [0, 0]
something = justTesting.index(item) # = 0 First occurrence of [0, 0] in justTesting
# is **always** at index 0
item = justTesting[1] # = cd = [0, 0]
something = justTesting.index(item) # = 0
item = justTesting[2] # = ef = [0, 0]
something = justTesting.index(item) # = 0
justTesting does not change as you iterate and the first position in justTesting at which [0,0] is found is always 0.
But I don't understand why it matters what kind of values I use, why would the index care about what is in the list?
Possibly what is confusing you is the fact that index() does not search for occurrences of the item "in abstract" but it looks at the values of items in a list and compares those values with a given value of item. That is,
[ab, cd, ef].index(cd)
is equivalent to
[[0,0],[0,0],[0,0].index([0,0])
and the first occurrence of [0,0] value (!!!) is at 0 index of the list for your specific values for a, b, c, d, e, and f.

Categories

Resources