Simultaneously change occurrences in a numpy array - python

I have a numpy array that looks something like this:
h = array([string1 1
string2 1
string3 1
string4 3
string5 4
string6 2
string7 2
string8 4
string9 3
string0 2 ])
In the second column, I would like to change all occurrences of 1 to 3, all occurrences of 3 to 2, all occurrences of 4 to 1
Obviously if I systematically try to do it in place I will get an error, because:
h[,:1 == 1] = 3
h[,:1 == 3] = 2
will change all the 1's into 2's
The matrix can be up to 50,000 elements long, and the values to change might vary
I was looking at a similar question here , but it was turning all digits to 0, and the answers were specific to that.
Is there a way to simultaneously change all these occurrences or am I going to have to find another way?

You can use a look up table and advanced indexing:
A = np.rec.fromarrays([np.array("The quick brown fox jumps over the lazy dog .".split()), np.array([1,1,1,3,4,2,2,4,3,2])])
A
# rec.array([('The', 1), ('quick', 1), ('brown', 1), ('fox', 3),
# ('jumps', 4), ('over', 2), ('the', 2), ('lazy', 4), ('dog', 3),
# ('.', 2)],
# dtype=[('f0', '<U5'), ('f1', '<i8')])
LU = np.arange(A['f1'].max()+1)
LU[[1,3,4]] = 3,2,1
A['f1'] = LU[A['f1']]
A
# rec.array([('The', 3), ('quick', 3), ('brown', 3), ('fox', 2),
# ('jumps', 1), ('over', 2), ('the', 2), ('lazy', 1), ('dog', 2),
# ('.', 2)],
# dtype=[('f0', '<U5'), ('f1', '<i8')])

You can either use map directly, or use the more efficient numpy.vectorize to turn a mapping function into a function that can be applied to the array directly:
import numpy as np
mapping = {
1: 3,
3: 4,
4: 1
}
a = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5])
mapping_func = np.vectorize(lambda x: mapping[x] if x in mapping else x)
b = mapping_func(a)
print(a)
print(b)
Result:
[1 2 3 4 5 1 2 3 4 5]
[3 2 4 1 5 3 2 4 1 5]
Note that you don't have to use a dict or a lambda function. You function could be any normal function that takes the data type of your source array as an input (int in this case) and returns the data type of the target array.

The best way to do it is to use a dict to map a value. Doing so requires you to use a vectorized function:
import numpy as np
a = [[1,1],[1,2],[1,3]]
a = np.array([[1,1],[1,2],[1,3]])
>>> a
array([[1, 1],
[1, 2],
[1, 3]])
dic = {3:2,2:3}
vfunc = np.vectorize(lambda x:dic[x] if x in dic else x)
a[:,1] = vfunc(a[:,1])
>>> a
array([[1, 1],
[1, 3],
[1, 2]])

Related

python pandas pulling two values out of the same column

What I have is a basic dataframe that I want to pull two values out of, based on index position. So for this:
first_column
second_column
1
1
2
2
3
3
4
4
5
5
I want to extract the values in row 1 and row 2 (1 2) out of first_column, then extract values in row 2 and row 3 (2 3) out of the first_column, so on and so forth until I've iterated over the entire column. I ran into an issue with the four loop and am stuck with getting the next index value.
I have code like below:
import pandas as pd
data = {'first_column': [1, 2, 3, 4, 5],
'second_column': [1, 2, 3, 4, 5],
}
df = pd.DataFrame(data)
for index, row in df.iterrows():
print(index, row['first_column']) # value1
print(index + 1, row['first_column'].values(index + 1)) # value2 <-- error in logic here
Ignoring the prints, which will eventually become variables that are returned, how can I improve this to return (1 2), (2 3), (3 4), (4 5), etc.?
Also, is this easier done with iteritems() method instead of iterrows?
Not sure if this is what you want to achieve:
(temp= df.assign(second_column = df.second_column.shift(-1))
.dropna()
.assign(second_column = lambda df: df.second_column.astype(int))
)
[*zip(temp.first_column.array, temp.second_column.array)]
[(1, 2), (2, 3), (3, 4), (4, 5)]
A simpler solution from #HenryEcker:
list(zip(df['first_column'], df['first_column'].iloc[1:]))
I don't know if this answers your question, but maybe you can try this :
for i, val in enumerate(df['first_column']):
if val+1>5:
break
else:
print(val,", ",val+1)
If you Want to take these items in the same fashion you should consider using iloc instead of using iterrow.
out = []
for i in range(len(df)-1):
print(i, df.iloc[i]["first_column"])
print(i+1, df.iloc[i+1]["first_column"])
out.append((df.iloc[i]["first_column"],
df.iloc[i+1]["first_column"]))
print(out)
[(1, 2), (2, 3), (3, 4), (4, 5)]

Numpy/Pandas: Merge two numpy arrays based on one array efficiently

I have two numpy arrays comprised of two-set tuples:
a = [(1, "alpha"), (2, 3), ...]
b = [(1, "zylo"), (1, "xen"), (2, "potato", ...]
The first element in the tuple is the identifier and shared between both arrays, so I want to create a new numpy array which looks like this:
[(1, "alpha", "zylo", "xen"), (2, 3, "potato"), etc...]
My current solution works, but it's way too inefficient for me. Looks like this:
aggregate_collection = []
for tuple_set in a:
for tuple_set2 in b:
if tuple_set[0] == tuple_set2[0] and other_condition:
temp_tup = (tuple_set[0], other tuple values)
aggregate_collection.append(temp_tup)
How can I do this efficiently?
I'd concatenate these into a data frame and just groupby+agg
(pd.concat([pd.DataFrame(a), pd.DataFrame(b)])
.groupby(0)
.agg(lambda s: [s.name, *s])[1])
where 0 and 1 are the default column names given by creating a dataframe via pd.DataFrame. Change it to your column names.
In [278]: a = [(1, "alpha"), (2, 3)]
...: b = [(1, "zylo"), (1, "xen"), (2, "potato")]
In [279]: a
Out[279]: [(1, 'alpha'), (2, 3)]
In [280]: b
Out[280]: [(1, 'zylo'), (1, 'xen'), (2, 'potato')]
Note that if I try to make an array from a I get something quite different.
In [281]: np.array(a)
Out[281]:
array([['1', 'alpha'],
['2', '3']], dtype='<U21')
In [282]: _.shape
Out[282]: (2, 2)
defaultdict is a handy tool for collecting like-keyed values
In [283]: from collections import defaultdict
In [284]: dd = defaultdict(list)
In [285]: for tup in a+b:
...: k,v = tup
...: dd[k].append(v)
...:
In [286]: dd
Out[286]: defaultdict(list, {1: ['alpha', 'zylo', 'xen'], 2: [3, 'potato']})
which can be cast as a list of tuples with:
In [288]: [(k,*v) for k,v in dd.items()]
Out[288]: [(1, 'alpha', 'zylo', 'xen'), (2, 3, 'potato')]
I'm using a+b to join the lists, since it apparently doesn't matter where the tuples occur.
Out[288] is even a poor numpy fit, since the tuples differ in size, and items (other than the first) might be strings or numbers.

How to random assign 0 or 1 for x rows depend upon y column value in excel

I'm trying to generate a below sample data in excel. There are 3 columns and I want output similar present to IsShade column. I've tried =RANDARRAY(20,1,0,1,TRUE) but not working exactly.
I want to display random '1' value only upto value present in shading for NoOfcells value rows.
NoOfCells Shading IsShade(o/p)
5 2 0
5 2 0
5 2 1
5 2 0
5 2 1
--------------------
4 3 1
4 3 1
4 3 0
4 3 1
--------------------
4 1 0
4 1 0
4 1 0
4 1 1
Appreciate if anyone can help me out.Python code will also work since the excel I will read in csv and try to generate output IsShade column. Thank you!!
A small snippet of Python that writes your excel file. This code does not use Pandas or NumPy, only the standard library, to keep it simple if you want to use Python with Excel.
import random
import itertools
import csv
cols = ['NoOfCells', 'Shading', 'IsShade(o/p)']
data = [(5, 2), (4, 3), (4, 1)] # (c, s)
lst = []
for c, s in data: # c=5, s=2
l = [0]*(c-s) + [1]*s # 3x[0], 2x[1] -> [0, 0, 0, 1, 1]
random.shuffle(l) # shuffle -> [1, 0, 0, 0, 1]
lst.append(zip([c]*c, [s]*c, l))
# flat the list
lst = list(itertools.chain(*lst))
with open('shade.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(cols)
writer.writerows(lst)
>>> lst
[(5, 2, 1),
(5, 2, 0),
(5, 2, 0),
(5, 2, 0),
(5, 2, 1),
(4, 3, 1),
(4, 3, 0),
(4, 3, 1),
(4, 3, 1),
(4, 1, 0),
(4, 1, 0),
(4, 1, 1),
(4, 1, 0)]
$ cat shade.csv
NoOfCells,Shading,IsShade(o/p)
5,2,0
5,2,0
5,2,1
5,2,0
5,2,1
4,3,1
4,3,1
4,3,1
4,3,0
4,1,0
4,1,1
4,1,0
4,1,0
You can count the number or rows for RANDARRAY to return using COUNTA. Also. to exclude the dividing lines, test for ISNUMBER
=LET(Data,FILTER(B:B,(B:B<>"")*(ROW(B:B)>1)),IF(ISNUMBER(Data),RANDARRAY(COUNTA(Data),1,0,1,TRUE),""))

Finding the min or max sum of a row in an array

How can I quickly find the min or max sum of the elements of a row in an array?
For example:
1, 2
3, 4
5, 6
7, 8
The minimum sum would be row 0 (1 + 2), and the maximum sum would be row 3 (7 + 8)
print mat.shape
(8, 1, 2)
print mat
[[[-995.40045 -409.15112]]
[[-989.1511 3365.3267 ]]
[[-989.1511 3365.3267 ]]
[[1674.5447 3035.3523 ]]
[[ 0. 0. ]]
[[ 0. 3199. ]]
[[ 0. 3199. ]]
[[2367. 3199. ]]]
In native Python, min and max have key functions:
>>> LoT=[(1, 2), (3, 4), (5, 6), (7, 8)]
>>> min(LoT, key=sum)
(1, 2)
>>> max(LoT, key=sum)
(7, 8)
If you want the index of the first min or max in Python, you would do something like:
>>> min(((i, t) for i, t in enumerate(LoT)), key=lambda (i,x): sum(x))
(0, (1, 2))
And then peel that tuple apart to get what you want. You also could use that in numpy, but at unknown (to me) performance cost.
In numpy, you can do:
>>> a=np.array(LoT)
>>> a[a.sum(axis=1).argmin()]
array([1, 2])
>>> a[a.sum(axis=1).argmax()]
array([7, 8])
To get the index only:
>>> a.sum(axis=1).argmax()
3
x = np.sum(x,axis=1)
min_x = x.min()
max_x = x.max()
presuming x is 4,2 array use np.sum to sum across the rows, then .min() returns the min value of your array and .max() returns the max value
You can do this using np.argmin and np.sum:
array_minimum_index = np.argmin([np.sum(x, axis=1) for x in mat])
array_maximum_index = np.argmax([np.sum(x, axis=1) for x in mat])
For your array, this results in array_minimum_index = 0 and array_maximum_index = 7, as your sums at those indices are -1404.55157 and 5566.0
To simply print out the values of the min and max sum, you can do this:
array_sum_min = min([np.sum(x,axis=1) for x in mat])
array_sum_max = max([np.sum(x,axis=1) for x in mat])
You can use min and max and use sum as their key.
lst = [(1, 2), (3, 4), (5, 6), (7, 8)]
min(lst, key=sum) # (1, 2)
max(lst, key=sum) # (7, 8)
If you want the sum directly and you do not care about the tuple itself, then map can be of help.
min(map(sum, lst)) # 3
max(map(sum, lst)) # 15

Extract array (column name, data) from Pandas DataFrame

This is my first question at Stack Overflow.
I have a DataFrame of Pandas like this.
a b c d
one 0 1 2 3
two 4 5 6 7
three 8 9 0 1
four 2 1 1 5
five 1 1 8 9
I want to extract the pairs of column name and data whose data is 1 and each index is separate at array.
[ [(b,1.0)], [(d,1.0)], [(b,1.0),(c,1.0)], [(a,1.0),(b,1.0)] ]
I want to use gensim of python library which requires corpus as this form.
Is there any smart way to do this or to apply gensim from pandas data?
Many gensim functions accept numpy arrays, so there may be a better way...
In [11]: is_one = np.where(df == 1)
In [12]: is_one
Out[12]: (array([0, 2, 3, 3, 4, 4]), array([1, 3, 1, 2, 0, 1]))
In [13]: df.index[is_one[0]], df.columns[is_one[1]]
Out[13]:
(Index([u'one', u'three', u'four', u'four', u'five', u'five'], dtype='object'),
Index([u'b', u'd', u'b', u'c', u'a', u'b'], dtype='object'))
To groupby each row, you could use iterrows:
from itertools import repeat
In [21]: [list(zip(df.columns[np.where(row == 1)], repeat(1.0)))
for label, row in df.iterrows()
if 1 in row.values] # if you don't want empty [] for rows without 1
Out[21]:
[[('b', 1.0)],
[('d', 1.0)],
[('b', 1.0), ('c', 1.0)],
[('a', 1.0), ('b', 1.0)]]
In python 2 the list is not required since zip returns a list.
Another way would be
In [1652]: [[(c, 1) for c in x[x].index] for _, x in df.eq(1).iterrows() if x.any()]
Out[1652]: [[('b', 1)], [('d', 1)], [('b', 1), ('c', 1)], [('a', 1), ('b', 1)]]

Categories

Resources