Check whether non-index column sorted in Pandas - python

Is there a way to test whether a dataframe is sorted by a given column that's not an index (i.e. is there an equivalent to is_monotonic() for non-index columns) without calling a sort all over again, and without converting a column into an index?

Meanwhile, since 0.19.0, there is pandas.Series.is_monotonic_increasing, pandas.Series.is_monotonic_decreasing, and pandas.Series.is_monotonic.

There are a handful of functions in pd.algos which might be of use. They're all undocumented implementation details, so they might change from release to release:
>>> pd.algos.is[TAB]
pd.algos.is_lexsorted pd.algos.is_monotonic_float64 pd.algos.is_monotonic_object
pd.algos.is_monotonic_bool pd.algos.is_monotonic_int32
pd.algos.is_monotonic_float32 pd.algos.is_monotonic_int64
The is_monotonic_* functions take an array of the specified dtype and a "timelike" boolean that should be False for most use cases. (Pandas sets it to True for a case involving times represented as integers.) The return value is a tuple whose first element represents whether the array is monotonically non-decreasing, and whose second element represents whether the array is monotonically non-increasing. Other tuple elements are version-dependent:
>>> df = pd.DataFrame({"A": [1,2,2], "B": [2,3,1]})
>>> pd.algos.is_monotonic_int64(df.A.values, False)[0]
True
>>> pd.algos.is_monotonic_int64(df.B.values, False)[0]
False
All these functions assume a specific input dtype, even is_lexsorted, which assumes the input is a list of int64 arrays. Pass it the wrong dtype, and it gets really confused:
In [32]: pandas.algos.is_lexsorted([np.array([-2, -1], dtype=np.int64)])
Out[32]: True
In [33]: pandas.algos.is_lexsorted([np.array([-2, -1], dtype=float)])
Out[33]: False
In [34]: pandas.algos.is_lexsorted([np.array([-1, -2, 0], dtype=float)])
Out[34]: True
I'm not entirely sure why Series don't already have some kind of short-circuiting is_sorted. There might be something which makes it trickier than it seems.

You can use the numpy method:
import numpy as np
def is_df_sorted(df, colname):
return (np.diff(df[colname]) > 0).all()
A more direct approach (like you suggested, but you say you don't want it..) is to convert to an index and use the is_monotonic property:
import pandas as pd
def is_df_sorted(df, colname):
return pd.Index(df[colname]).is_monotonic

Related

Are the outcomes of the numpy.where method on a pandas dataframe calculated on the full array or the filtered array?

I want to use a numpyp.where on a pandas dataframe to check for existence of a certain string in a column. If the string is present apply a split-function and take the second list element, if not just take the first character. However the following code doesn't work, it throws a IndexError: list index out of range because the first entry contains no underscore:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a_1','b_','b_2_3']})
df["B"] = np.where(df.A.str.contains('_'),df.A.apply(lambda x: x.split('_')[1]),df.A.str[0])
Only calling np.where returns an array of indices for which the condition holds true, so I was under the impression that the split-command would only be used on that subset of the data:
np.where(df.A.str.contains('_'))
Out[14]: (array([1, 2, 3], dtype=int64),)
But apparently the split-command is used on the entire unfiltered array which seems odd to me as that seems like a potentially big number of unnecessary operations that would slow down the calculation.
I'm not asking for an alternative solution, coming up with that isn't hard.
I'm merely wondering if this is an expected outcome or an issue with either pandas or numpy.
Python isn't a "lazy" language so code is evaluated immediately. generators/iterators do introduce some lazyness, but that doesn't apply here
if we split your line of code, we get the following statements:
df.A.str.contains('_')
df.A.apply(lambda x: x.split('_')[1])
df.A.str[0]
Python has to evaluate these statements before it can pass them as arguments to np.where
to see all this happening, we can rewrite the above as little functions that displays some output:
def fn_contains(x):
print('contains', x)
return '_' in x
def fn_split(x):
s = x.split('_')
print('split', x, s)
# check for errors here
if len(s) > 1:
return s[1]
def fn_first(x):
print('first', x)
return x[0]
and then you can run them on your data with:
s = pd.Series(['a','a_1','b_','b_2_3'])
np.where(
s.apply(fn_contains),
s.apply(fn_split),
s.apply(fn_first)
)
and you'll see everything being executed in order. this is basically what's happening "inside" numpy/pandas when you execute things
In my opinion numpy.where only set values by condition, so second and third arrays are counted for all data - filtered and also non filtered.
If need apply some function only for filtered values:
mask = df.A.str.contains('_')
df.loc[mask, "B"] = df.loc[mask, "A"].str.split('_').str[1]
In your solution is error, but problem is not connected with np.where. After split by _ if not exist value, get one eleemnt list, so selecting second value of list by [1] raise error:
print (df.A.apply(lambda x: x.split('_')))
0 [a]
1 [a, 1]
2 [b, ]
3 [b, 2, 3]
Name: A, dtype: object
print (df.A.apply(lambda x: x.split('_')[1]))
IndexError: list index out of range
So here is possible use pandas solution, if performance is not important, because strings functions are slow:
df["B"] = np.where(df.A.str.contains('_'),
df.A.str.split('_').str[1],
df.A.str[0])

How to add values in one array according to repeated values in another array?

Suppose I have an array:
Values = np.array([0.221,0.35,25.9,54.212,0.0022])
Indices = np.array([22,10,11,22,10])
I would like to add elements of 'Values' together that share the same number in 'Indices'.
In other words, my desired outputs(s):
Total = np.array([0.221+54.212,0.35+0.002,25.9])
Index = np.array([22,10,11])
I've been trying to use np.unique to no avail. Can't quite figure this out!
We can use np.unique with its optional arg return_inverse to get IDs based on uniqueness within Indices and then use those with bincount to get binned (ID based) summations and hence solve it like so -
Index,idx = np.unique(Indices, return_inverse=True)
Total = np.bincount(idx, Values)
Outputs for given sample -
In [32]: Index
Out[32]: array([10, 11, 22])
In [33]: Total
Out[33]: array([ 0.3522, 25.9 , 54.433 ])
Alternatively, we can use pandas.factorize to get the unique IDs and then bincount as shown earlier. So, the first step could be replaced by something like this -
import pandas as pd
idx,Index = pd.factorize(Indices)
One possibility is to consider using Pandas:
In [14]: import pandas as pd
In [15]: pd.DataFrame({'Values': Values, 'Indices': Indices}).groupby('Indices').agg(sum)
Out[15]:
Values
Indices
10 0.3522
11 25.9000
22 54.4330
This should be self-explanatory, though it doesn't preserve the order of indices (it's not entirely clear from the question whether you care about that).

Set max for a particular column of numpy array

Is there anyway to basically take a column of a numpy array and whenever the absolute value is greater than a number, set the value to that signed number.
ie.
for val in col:
if abs(val) > max:
val = (signed) max
I know this can be done by looping and such but i was wondering if there was a cleaner/builtin way to do this.
I see there is something like
arr[arr > 255] = x
Which is kind of what i want but i want do this by column instead of the whole array. As a bonus maybe a way to do absolute values instead of having to do two separate operations for positive and negative.
The other answer is good but it doesn't get you all the way there. Frankly, this is somewhat of a RTFM situation. But you'd be forgiven for not grokking the Numpy indexing docs on your first try, because they are dense and the data model will be alien if you are coming from a more traditional programming environment.
You will have to use np.clip on the columns you want to clip, like so:
x[:,2] = np.clip(x[:,2], 0, 255)
This applies np.clip to the 2nd column of the array, "slicing" down all rows, then reassigns it to the 2nd column. The : is Python syntax meaning "give me all elements of an indexable sequence".
More generally, you can use the boolean subsetting index that you discovered in the same fashion, by slicing across rows and selecting the desired columns:
x[x[:,2] > 255, 2] = -1
Try calling clip on your numpy array:
import numpy as np
values = np.array([-3,-2,-1,0,1,2,3])
values.clip(-2,2)
Out[292]:
array([-2, -2, -1, 0, 1, 2, 2])
Maybe is a little late, but I think it's a good option:
import numpy as np
values = np.array([-3,-2,-1,0,1,2,3])
values = np.clip(values,-2,2)

Difference between max and np.max

I have a question on the difference between just using max(list array) and np.max(list array).
Is the only difference here the time it takes for Python to return the code?
They may differ in edge cases, such as a list containing NaNs.
import numpy as np
a = max([2, 4, np.nan]) # 4
b = np.max([2, 4, np.nan]) # nan
NumPy propagates NaN in such cases, while the behavior of Python's max is less certain.
There are also subtle issues regarding data types:
a = max([10**n for n in range(20)]) # a is an integer
b = np.max([10**n for n in range(20)]) # b is a float
And of course running time differences documented in numpy.max or max ? Which one is faster?
Generally, one should use max for Python lists and np.max for NumPy arrays to minimize the number of surprises. For instance, my second example is not really about np.max but about the data type conversion: to use np.max the list is first converted to a NumPy array, but elements like 10**19 are too large to be represented by NumPy integer types so they become floats.

Proper way to use "opposite boolean" in Pandas data frame boolean indexing

I wanted to use a boolean indexing, checking for rows of my data frame where a particular column does not have NaN values. So, I did the following:
import pandas as pd
my_df.loc[pd.isnull(my_df['col_of_interest']) == False].head()
to see a snippet of that data frame, including only the values that are not NaN (most values are NaN).
It worked, but seems less-than-elegant. I'd want to type:
my_df.loc[!pd.isnull(my_df['col_of_interest'])].head()
However, that generated an error. I also spend a lot of time in R, so maybe I'm confusing things. In Python, I usually put in the syntax "not" where I can. For instance, if x is not none:, but I couldn't really do it here. Is there a more elegant way? I don't like having to put in a senseless comparison.
In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types).
While in this case we have notnull, ~ can come in handy in situations where there's no special opposite method.
>>> df = pd.DataFrame({"a": [1, 2, np.nan, 3]})
>>> df.a.isnull()
0 False
1 False
2 True
3 False
Name: a, dtype: bool
>>> ~df.a.isnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
>>> df.a.notnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
(For completeness I'll note that -, the unary negative operator, will also work on a boolean Series but ~ is the canonical choice, and - has been deprecated for numpy boolean arrays.)
Instead of using pandas.isnull() , you should use pandas.notnull() to find the rows where the column has not null values. Example -
import pandas as pd
my_df.loc[pd.notnull(my_df['col_of_interest'])].head()
pandas.notnull() is the boolean inverse of pandas.isnull() , as given in the documentation -
See also
pandas.notnull
boolean inverse of pandas.isnull

Categories

Resources