np.maximum for scalar and pandas Series without np.nan - python

I have a list of pd.Series and scalar values (float and int) which I'd like to find the element-wise maximum for (Series are all same length). If there is a np.nan value, another value should be used (np.nan if only nans are available). This works fine as long as the series or values in the list don't contain nan values, but if they do the nans dominate the resulting series.
rv = input_list[0]
for s in input_list[1:]:
rv = np.maximum(s, rv)
As an example
input_list = [pd.Series([1, 2, 3, 1]), 2, pd.Series([3, 1, np.nan, 4])]
should return:
pd.Series([3, 2, 3, 4])
How can I modify this code to take care of nan values and ignore them if there are alternative values?

Solution using numpy.nanmax
You are looking for numpy.nanmax. From its documentation:
Return the maximum of an array or maximum along an axis, ignoring any
NaNs. When all-NaN slices are encountered a RuntimeWarning is raised
and NaN is returned for that slice.
So if you know that the maximum size of the series is n:
n= 4
result = pd.Series(np.nanmax(
[np.full(n, i) if np.isscalar(i) else i for i in input_list], axis=0))
Running it on the example:
input_list = [pd.Series([1, 2, 3, 1]), 2, pd.Series([3, 1, np.nan, 4])]
result = pd.Series(np.nanmax(
[np.full(n, i) if np.isscalar(i) else i for i in input_list], axis=0))
Output:
0 3.0
1 2.0
2 3.0
3 4.0
dtype: float64

Related

Find index of max element in numpy array excluding few indexes

Say:
p = array([4, 0, 8, 2, 7])
Want to find the index of max value, except few indexes, say:
excptIndx = [2, 3]
Ans: 4, as 7 will be max.
if excptIndx = [1, 3], Ans: 2, as 8 will be max.
In numpy, you can mask all values at excptIndx and run argmax to obtain index of max element:
import numpy as np
p = np.array([4, 0, 8, 2, 7])
excptIndx = [2, 3]
m = np.zeros(p.size, dtype=bool)
m[excptIndx] = True
a = np.ma.array(p, mask=m)
print(np.argmax(a))
# 4
The setup:
In [153]: p = np.array([4,0,8,2,7])
In [154]: exceptions = [2,3]
Original indexes in p:
In [155]: idx = np.arange(p.shape[0])
delete exceptions from both:
In [156]: np.delete(p,exceptions)
Out[156]: array([4, 0, 7])
In [157]: np.delete(idx,exceptions)
Out[157]: array([0, 1, 4])
Find the argmax in the deleted array:
In [158]: np.argmax(np.delete(p,exceptions))
Out[158]: 2
Use that to find the max value (could just as well use np.max(_156)
In [159]: _156[_158]
Out[159]: 7
Use the same index to find the index in the original p
In [160]: _157[_158]
Out[160]: 4
In [161]: p[_160] # another way to get the max value
Out[161]: 7
For this small example, the pure Python alternatives might well be faster. They often are in small cases. We need test cases with a 1000 or more values to really see the advantages of numpy.
Another method
Set the exceptions to a small enough value, and take the argmax:
In [162]: p1 = p.copy(); p1[exceptions] = -1000
In [163]: np.argmax(p1)
Out[163]: 4
Here the small enough is easy to pick; more generally it may require some thought.
Or taking advantage of the np.nan... functions:
In [164]: p1 = p.astype(float); p1[exceptions]=np.nan
In [165]: np.nanargmax(p1)
Out[165]: 4
A solution is
mask = np.isin(np.arange(len(p)), excptIndx)
subset_idx = np.argmax(p[mask])
parent_idx = np.arange(len(p))[mask][subset_idx]
See http://seanlaw.github.io/2015/09/10/numpy-argmin-with-a-condition/
p = np.array([4,0,8,2,7]) # given
exceptions = [2,3] # given
idx = list( range(0,len(p)) ) # simple array of index
a1 = np.delete(idx, exceptions) # remove exceptions from idx (i.e., index)
a2 = np.argmax(np.delete(p, exceptions)) # get index of the max value after removing exceptions from actual p array
a1[a2] # as a1 and a2 are in sync, this will give the original index (as asked) of the max value

Selective deletion by value in numpy array

EDITED: Refined problem statement
I am still figuring out the fancy options which are offered by the numpy library. Following topic came on my desk:
Purpose:
In a multi-dimensional array I select one column. This slicing works fine. But after that, values stored in another list need to be filtered out of the column values.
Current status:
array1 = np.asarray([[0,1,2],[1,0,3],[2,3,0]])
print(array1)
array1woZero = np.nonzero(array1)
print(array1woZero)
toBeRemoved = []
toBeRemoved.append(1)
print(toBeRemoved)
column = array1[:,1]
result = np.delete(column,toBeRemoved)
The above mentioned code does not bring the expected result. In fact, the np.delete() command just removes the value at index 1 - but I would need the value of 1 to be filtered out instead. What I also do not understand is the shape change when applying the nonzero to array1: While array1 is (3,3), the array1woZero turns out into a tuple of 2 dims with 6 values each.
0
Array of int64
(6,)
0
0
1
1
2
2
1
Array of int64
(6,)
1
2
0
2
0
1
My feeling is that I would require something like slicing with an exclusion operator. Do you have any hints for me to solve that? Is it necessary to use different data structures?
In [18]: arr = np.asarray([[0,1,2],[1,0,3],[2,3,0]])
In [19]: arr
Out[19]:
array([[0, 1, 2],
[1, 0, 3],
[2, 3, 0]])
nonzero gives the indices of all non-zero elements of its argument (arr):
In [20]: idx = np.nonzero(arr)
In [21]: idx
Out[21]: (array([0, 0, 1, 1, 2, 2]), array([1, 2, 0, 2, 0, 1]))
This is a tuple of arrays, one per dimension. That output can be confusing, but it is easily used to return all of those non-zero elements:
In [22]: arr[idx]
Out[22]: array([1, 2, 1, 3, 2, 3])
Indexing like this, with a pair of arrays, produces a 1d array. In your example there is just one 0 per row, but in general that's not the case.
This is the same indexing - with 2 lists of the same length:
In [24]: arr[[0,0,1,1,2,2], [1,2,0,2,0,1]]
Out[24]: array([1, 2, 1, 3, 2, 3])
idx[0] just selects on array of that tuple, the row indices. That probably isn't what you want. And I doubt if you want to apply np.delete to that tuple.
It's hard to tell from the description, and code, what you want. Maybe that's because you don't understand what nonzero is producing.
We can also select the nonzero elements with boolean masking:
In [25]: arr>0
Out[25]:
array([[False, True, True],
[ True, False, True],
[ True, True, False]])
In [26]: arr[ arr>0 ]
Out[26]: array([1, 2, 1, 3, 2, 3])
the hint with the boolean masking very good and helped me to develop my own solution. The symbolic names in the following code snippets are different, but the idea should become clear anyway.
At the beginning, I have my overall searchSpace.
searchSpace = relativeDistances[currentNode,:]
Assume that its shape is (5,). My filter is defined on the indexes, i.e. range 0..4. Then I define another numpy array "filter" of same shape with all 1, and the values to be filtered out I set to 0.
filter = np.full(shape=nodeCount,fill_value=1,dtype=np.int32())
filter[0] = 0
filter[3] = 0
searchSpace = searchSpace * filter
minValue = searchSpace[searchSpace > 0].min()
neighborNode = np.where(searchSpace==minValue)
The filter array provides me the flexibility to adjust the filter later on as part of a loop. Using the element-wise multiplication with 0 and subsequent boolean masking, I can create my reduced searchSpace for minimum search. Compared to a separate array or list, I still have the original shape, which is required to get the correct index in the where-statement.

Populate Pandas Series with list

I would like to populate a pd.Series() with a list.
I tried doing the following:
series = pd.Series(index=['a','b','c','d'])
series['a'] = 2
series['b'] = [2,3]
This is the error that I get. How can I populate the list in the pd.Series?
File "C:\Users\Sergej Shteriev\Anaconda3\lib\site-packages\pandas\core\internals.py", line 940, in setitem
values[indexer] = value
ValueError: setting an array element with a sequence.
This is because the initial dtype is assumed to be float (as the series is filled with NaNs).
series.dtype
# dtype('float64')
Since lists are only supported by object type columns, you'd need to cast before assigning.
series = series.astype(object)
series['b'] = [2, 3]
series
a 2 # this is still a float
b [2, 3]
c NaN
d NaN
dtype: object
series.tolist()
# [2.0, [[2, 3]], nan, nan]
A better suggestion is to declare series as an object at the start if that's what you intend stuffing into it.
series = pd.Series(index=['a','b','c','d'], dtype=object)
series['a'] = 2
series['b'] = [2, 3]
series
a 2
b [2, 3]
c NaN
d NaN
dtype: object
series.tolist()
# [2, [2, 3], nan, nan]
Of course, for performance reasons, I don't condone this. You're better off using python lists -- they're usually faster than object Series.

pandas.Series method that returns updated series

Is there a Series method that acts like update but returns the updated series instead of updating in place?
Put another way, is there a better way to do this:
# Original series
ser1 = Series([10, 9, 8, 7], index=[1,2,3,4])
# I want to change ser1 to be [10, 1, 2, 7]
adj_ser = Series([1, 2], index=[2,3])
adjusted = my_method(ser1, adj_ser)
# Is there a builtin that does this already?
def my_method(current_series, adjustments):
x = current_series.copy()
x.update(adjustments)
return(x)
One possible solution should be combine_first, but it update adj_ser by ser1, also it cast integers to floats:
adjusted = adj_ser.combine_first(ser1)
print (adjusted)
1 10.0
2 1.0
3 2.0
4 7.0
dtype: float64
#nixon is right that iloc and loc are good for this kind of thing
import pandas as pd
# Original series
ser1 = pd.Series([10, 9, 8, 7], index=[1,2,3,4])
ser2 = ser1.copy()
ser3 = ser1.copy()
# I want to change ser1 to be [10, 1, 2, 7]
# One way
ser2.iloc[1:3] = [1,2]
ser2 # [10, 1, 2, 7]
# Another way
ser3.loc[2, 3] = [1,2]
ser3 # [10, 1, 2, 7]
Why two different methods?
As this post explains quite well, the major difference between loc and iloc is labels vs position. My personal shorthand is if you're trying to make adjustments based on the zero-index position of a value use iloc otherwise loc. YMMV
No built-in function other than update, but you can use mask with a Boolean series:
def my_method(current_series, adjustments):
bools = current_series.index.isin(adjustments.index)
return current_series.mask(bools, adjustments)
However, as the masking process introduces intermediary NaN values, your series will be upcasted to float. So your update solution is best.
Here is another way:
adjusted = ser1.mask(ser1.index.isin(adj_ser.index), adj_ser)
adjusted
Output:
1 10
2 1
3 2
4 7
dtype: int64

assigning an alternative value to pandas dataFrame conditional on its value

I am trying to assign alternative values to a column in a pandas dataFrame object. The condition to assigning an alternative value is that the element has value zero now.
This is my code snippet:
df = pd.DataFrame({'A': [0, 1, 2, 0, 0, 1, 1 ,0], 'B': [1, 2, 3, 4, 1, 2, 3, 4]})
for i, row in df.iterrows():
if row['A'] == 0.0:
df.iloc[i]['A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
However, as it turns out, the values in these elements remain zero! The above has zero effect.
What's going on?
The original answer below works for some inputs, but it's not entirely right. Testing your code with the dataframe in your question, I found that it works, but it's not guaranteed to work with all dataframes. Here's an example where it doesn't work:
df = pd.DataFrame(np.random.randn(6,4), index=list(range(0,12,2)), columns=['A', 'B', 'C', 'D'])
This dataframe will cause your code to fail because the indices are not 0, 1, 2... as your algorithm expects, they're 0, 2, 4, ..., as defined by index=list(range(0,12,2)).
That means the values of i returned by the iterator will also be 0, 2, 4,..., so you'll get unexpected results when you try to use i-1 as a parameter to iloc.
In short, when you use for i, row in df.iterrows(): to iterate over a dataframe, i takes on the index values of the dimension you're iterating over as they're defined in the dataframe. Make sure you know what those values are when using them with offsets inside the loop.
Original answer:
I can't figure out why your code doesn't work, but I can verify that it doesn't. It may have something to do with modifying a dataframe while iterating over it, since you can use df.iloc[1]['A'] = 0.0 to set a value outside a loop with no problems.
Try using DataFrame.at instead:
for i, row in df.iterrows():
if row['A'] == 0.0:
df.at[i, 'A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
This doesn't do anything to account for df.iloc[i-1] returning the last row in the dataframe, so be aware of that when the first value in column A is 0.0.
What about:
df = pd.DataFrame({'A': [0, 1, 2, 0, 0, 1, 1 ,0], 'B': [1, 2, 3, 4, 1, 2, 3, 4]})
df['A'] = df.where(df[['A']] != 0,
df['A'].shift() + df['B'] - df['B'].shift(),
axis=0)['A']
print(df)
A B
0 NaN 1
1 1.0 2
2 2.0 3
3 3.0 4
4 -3.0 1
5 1.0 2
6 1.0 3
7 2.0 4
The NaN is there since there is no element prior to the first element
You are using chained indexing which is related to the famous SettingWithCopy warning. Check the SettingWithCopy setting in modern pandas by Tom Augspurger.
In general this means that assigments of the form df['A']['B']= ...are discouraged. It doesn't matter if you use a loc acessor there.
If you add print statements to your code:
for i, row in df.iterrows():
print(df)
if row['A'] == 0.0:
df.iloc[i]['A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
you see strange things happening. The dataframe df is modified if and only if the first row the column 'A' is 0.
As Bill the Lizard pointed out, you need a single accessor. However, note that Bill's method has the disadvantage of providing label based access. This may not be what you want when having a dataframe that is differently indexed. Then a better solutions would be to use loc
for i, row in df.iterrows():
if row['A'] == 0.0:
df.loc[df.index[i], 'A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
or iloc
for i, row in df.iterrows():
if row['A'] == 0.0:
df.iloc[i, df.columns.get_loc('A')] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
assuming the index is unique in the last case.
Note that the chained indexing occurs when setting values.
Though this approach works, it's - by the quote above - not encouraged!

Categories

Resources