compare rows of a numpy array to all rows

compare rows of a numpy array to all rows - python

I'm trying to compare each row of a numpy array with the whole numpy array without using iteration.
>>> sample = np.array([[1,2,3],[4,5,6]])
>>> sample
array([[1, 2, 3],
[4, 5, 6]])
First I reshape the 2D-array to a 3D-array:
>>> sample2=sample.reshape(sample.shape[0],1,sample.shape[1])
And then with the following line of code I can compare the rows:
>>> sample2 == sample
array([[[ True, True, True],
[False, False, False]],
[[False, False, False],
[ True, True, True]]])
...which is the result that I'm looking for.
But this does not work with large numpy arrays:
>>> sample3 = np.random.randint(low= 0, high = 2, size = 30000000).reshape(30000,1000)
>>> sample4 = sample3.reshape(sample3.shape[0],1,sample3.shape[1])
>>> sample4 == sample3
<ipython-input-229-e1d55c6bb1ca>:1: DeprecationWarning: elementwise
comparison failed; this will raise an error in the future.
False
How can I solve this?

This may shed some light on your question. Here is my code sample, based on yours:
import numpy as np
n=30000000
ny = 1000
sample3 = np.random.randint(low= 0, high = 2, size = n).reshape(n // ny, ny)
sample4 = sample3.reshape(sample3.shape[0],1,sample3.shape[1])
print(sample3.shape, sample4.shape)
test = sample4 == sample3
print(test)
test = np.equal(sample4, sample3)
print(test)
Its output is:
(30000, 1000) (30000, 1, 1000)
C:\Users\XYZ\python\code_sample.py:7: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
test = sample4 == sample3
False
Traceback (most recent call last):
File "code_sample.py", line 9, in <module>
test = np.equal(sample4, sample3)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 838. GiB for an array with shape (30000, 30000, 1000) and data type bool
Also, here are the docs for numpy.equal() which is presumably used by the == operator for numpy arrays. They sate:
Input arrays. If x1.shape != x2.shape, they must be broadcastable to a common shape (which becomes the shape of the output).
So it looks like equal() may be attempting to use a substantial amount of memory (838 GB in the example above). Perhaps == decides to fail and give the deprecation warning (rather than something more apt, such as an out-of-memory error) when it realizes there's not enough memory?
Also, if I reduce n from 30000000 to 3000000 and comment out the call to equal(), execution of the == statement takes 10 or 20 seconds before the following result is printed:
(3000, 1000) (3000, 1, 1000)
[[[ True True True ... True True True]
[False True True ... True True True]
[ True True True ... True True True]
...
[False True True ... True False True]
[False False True ... True True False]
[ True False False ... False False False]]
[[False True True ... True True True]
[ True True True ... True True True]
[False True True ... True True True]
...
[ True True True ... True False True]
[ True False True ... True True False]
[False False False ... False False False]]
[[ True True True ... True True True]
[False True True ... True True True]
[ True True True ... True True True]
...
[False True True ... True False True]
[False False True ... True True False]
[ True False False ... False False False]]
...
[[False True True ... True False True]
[ True True True ... True False True]
[False True True ... True False True]
...
[ True True True ... True True True]
[ True False True ... True False False]
[False False False ... False True False]]
[[False False True ... True True False]
[ True False True ... True True False]
[False False True ... True True False]
...
[ True False True ... True False False]
[ True True True ... True True True]
[False True False ... False False True]]
[[ True False False ... False False False]
[False False False ... False False False]
[ True False False ... False False False]
...
[False False False ... False True False]
[False True False ... False False True]
[ True True True ... True True True]]
So it looks like the issue you've encountered is probably related to running out of memory.

Related

Index of last occurence of True in every row

I have a 2D array:
a = ([[False False False False False True True True True True True True
True True True True True True True True True True True True
True False False False]
[False False False False True True True True True True True True
True True True True True True True True True True True True
False False False False]])
I am trying to get the index of last occurrence of 'True' in every row.
So resulting array should be
b = ([24, 23])
To find the occurence of first True, I know i can use the np.argmax().
b = np.argmax(a==True,axis=1)
Is there a function to find from the last? I tried reversing the values of array and then using np.argmax(), but it will give the index of the reversed array.

Process each row in the backwards direction and subtract the
result from the row length - 1:
result = a.shape[1] - np.argmax(a[:, ::-1], axis=1) - 1
Even == True is not needed.
The result is:
array([24, 23], dtype=int64)

import numpy as np
a = np.array([[False ,False, False, False, False ,True , True , True , True ,True , True , True,
True , True , True , True, True , True ,True ,True ,True ,True , True, True,
True, False ,False, False],
[False ,False, False, False, False ,True , True , True , True ,True , True , True,
True , True , True , True, True , True ,True ,True ,True ,True , True, True,
False, False ,False, False]])
b = a[...,::-1]
c = [len(i) - np.argmax(i) - 1 for i in b]
print(c)
list 'c' will have indices with last True value

Pandas rolling: aggregate boolean values

Is there any rolling "any" function in a pandas.DataFrame? Or is there any other way to aggregate boolean values in a rolling function?
Consider:
import pandas as pd
import numpy as np
s = pd.Series([True, True, False, True, False, False, False, True])
# this works but I don't think it is clear enough - I am not
# interested in the sum but a logical or!
s.rolling(2).sum() > 0
# What I would like to have:
s.rolling(2).any()
# AttributeError: 'Rolling' object has no attribute 'any'
s.rolling(2).agg(np.any)
# Same error! AttributeError: 'Rolling' object has no attribute 'any'
So which functions can I use when aggregating booleans? (if numpy.any does not work)
The rolling documentation at https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.rolling.html states that "a Window or Rolling sub-classed for the particular operation" is returned, which doesn't really help.

You aggregate boolean values like this:
# logical or
s.rolling(2).max().astype(bool)
# logical and
s.rolling(2).min().astype(bool)
To deal with the NaN values from incomplete windows, you can use an appropriate fillna before the type conversion, or the min_periods argument of rolling. Depends on the logic you want to implement.
It is a pity this cannot be done in pandas without creating intermediate values as floats.

This method is not implemented, close, what you need is use Rolling.apply:
s = s.rolling(2).apply(lambda x: x.any(), raw=False)
print (s)
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 0.0
6 0.0
7 1.0
dtype: float64
s = s.rolling(2).apply(lambda x: x.any(), raw=False).fillna(0).astype(bool)
print (s)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 True
dtype: bool
Better here is use strides - generate numpy 2d arrays and processing later:
s = pd.Series([True, True, False, True, False, False, False, True])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(s.to_numpy(), 2)
print (a)
[[ True True]
[ True False]
[False True]
[ True False]
[False False]
[False False]
[False True]]
print (np.any(a, axis=1))
[ True True True True False False True]
Here first NaNs pandas values are omitted, you can add first values for processing, here Falses:
n = 2
x = np.concatenate([[False] * (n), s])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = rolling_window(x, n)
print (a)
[[False False]
[False True]
[ True True]
[ True False]
[False True]
[ True False]
[False False]
[False False]
[False True]]
print (np.any(a, axis=1))
[False True True True True True False False True]

def function1(ss:pd.Series):
s[ss.index.max()]=any(ss)
return 0
s.rolling(2).apply(function1).pipe(lambda ss:s)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True

Return multiple columns based on date range using pandas

I'm basically trying to calculate revenue to date using pandas. I would like to return N columns consisting of each quarter end. Each column would calculate total revenue to date as of that quarter end. I have:
df['Amortization_per_Day'] = (2.5, 3.2, 5.5, 6.5, 9.2)
df['Start_Date'] = ('1/1/2018', '2/27/2018', '3/31/2018', '5/23/2018', '6/30/2018')
Date_Range = pd.date_range('10/31/2017', periods=75, freq='Q-Jan')
and want to do something like:
df['Amortization_per_Day'] * (('Date_Range' - df['Start_Date']).dt.days + 1)
for each date within the Date_Range. I'm not sure how to pass the Date_Range through the function and to return N columns. I've been reading about zip(*df) and shift but not fully grasping it. Thank you so much for your help.

Solution
Here's a complete solution:
from datetime import datetime
import pandas as pd
df = pd.DataFrame()
df['Amortization_per_Day'] = (2.5, 3.2, 5.5, 6.5, 9.2)
df['Start_Date'] = ('1/1/18', '2/27/18', '3/31/18', '5/23/2018', '6/30/2018')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
dr = pd.date_range('10/31/2017', periods=75, freq='Q-Jan')
def betweendates(x, y):
xv = x.values.astype('datetime64[D]')
xpad = np.zeros(xv.size + 2, dtype=xv.dtype)
xpad[1:-1] = xv
xpad[0],xpad[-1] = np.datetime64(datetime.min), np.datetime64(datetime.max)
yv = y.values.astype('datetime64[D]')
return (xpad[:-1] <= yv[:,None]) & (xpad[1:] >= yv[:,None])
# get a boolean array that indicates which dates in dr are in between which dates in df['Start_Date']
btwn = betweendates(df['Start_Date'], dr)
# based on the boolean array btwn, select out the salient rows from df and dates from dr
dfsel = df[btwn[:, 1:].T]
drsel = dr[btwn[:, 1:].sum(axis=1, dtype=bool)]
# do the actual calculation the OP wanted
dfsel['Amortization_per_Day'] * ((drsel - dfsel['Start_Date']).dt.days + 1)
Output:
0 77.5
2 170.5
4 294.4
4 1140.8
4 1987.2
4 2806.0
4 3652.4
4 4498.8
4 5345.2
4 6173.2
...
4 52394.0
4 53212.8
4 54059.2
4 54905.6
4 55752.0
4 56570.8
4 57417.2
4 58263.6
4 59110.0
4 59938.0
Length: 74, dtype: float64
Explanation
The boolean btwn array looks like this:
[[ True False False False False False]
[False True False False False False]
[False False False True False False]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
...
The ith row of btwn corresponds to the ith datetime in your date range. In each row, exactly one value will be True, and the others will be False. A True value in the 0th column indicates that the datetime is before any of the Start_Times, a True value in the 1st column indicates that the datetime is in between the 0th and the 1st dates in Start_Times, and so forth. A True value in the last column indicates that the datetime is after any of the Start_Times.
By slicing btwn like this:
btwn[:, 1:]
it can be used to match up datetimes in your date range with the immediately preceding Start_Time. If you instead change the slices of btwn to be like this:
btwn[:, :-1]
you would end up matching each datetime to the next Start_Time instead.

Get external coordinates of a polygon from a numpy boolean grid

I am trying to get the external coordinates of a polygon from a numpy boolean grid. For example, from a (16, 16) ndarray such as the following one
[
[False False False False False False True True True True False False False False False False],
[False False False False False True True True True True True False False False False False],
[False False False False False False False False False False True True False False False False],
[False False False False False False False False False False False True False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False]
]
If we plot that ndarray it will be like the following:
I would like to get the following coordinates, in order, such that we could draw the external ring of such polygon, e.g., [(5 1), (6 0), (7 0), (8 0), (9 0), (10 1), (11 2), (11 3), (10 2), (9 1), (8 1), (7 1), (6 1)]. What I have so far is the following:
# Consider that the boolean ndarray above is called 'prediction'
import numpy as np
from shapely.geometry import Polygon, Point
import matplotlib.pyplot as plt
# Get the coordinates that match the boolean polygon
(y, x) = np.where(prediction == True)
# Iterate on each of the coordinates, however my problem is that it is not aware of the contour order as it should be :/
coordinates = [Point(x_coordinate, y_coordinate) for x_coordinate, y_coordinate in itertools.izip(x, y)]
# Build the polygon out of the points
polygon = Polygon([[coordinate.x, coordinate.y] for coordinate in coordinates])
exterior_x, exterior_y = polygon.exterior.xy
# Plotting
fig = plt.figure(1, figsize=(5, 5))
ax = fig.add_subplot(1, 2, 1)
ax.plot(exterior_x, exterior_y, color='#6699cc')
ax.invert_yaxis()
plt.subplot(1, 2, 2)
plt.imshow(prediction)
plt.show()
The problem is that I am building the polygon not considering the order so that the result of polygon.exterior.xy will create the external ring. My approach will create the wrong contour of the polygon such as:
However, I am unable to come up with a general approach for this problem. I welcome any suggestion on how to tackle this problem. Thanks in advance.

Perhaps you can move the question to GIS stack exchange site. There you will probably get more help on this.
Anyway, a quick search shows this anwer, where it is suggested to use rasterio library, which I understand is what you need.
Adapted to your case, it can be something as:
import numpy as np
import rasterio.features
# Convert your array to 0-1 integers
myarray = [[1 if t else 0 for t in row] for row in myarray]
# Build a numpy array
myarray = np.array(myarray)
# Convert the type (don't even know why this was needed in my computer, but raised exception if not converted.
myarray = myarray.astype(np.int32)
# Let the library do the magic. You should take a look at the rasterio.features.shapes output
mypols = [p[0]['coordinates']
for p in rasterio.features.shapes(myarray)]
mypols is now an array of coordinates that you can easily convert to shapely Polygons.
Beware of properly testing stranger cases. I tried to build a multipolygon, and the library returned each connected component as a polygon. Fortunately, it returns for each polygon the associated value, so you can post process as you like.
Polygons with interior rings seem to be handled OK, though.
I don't know what is the behavior you would expect in those cases.

I'd use ConvexHull, it tries to find the smallest enveloppe which contains all your points, that would be the polygon contour : ConvexHull with Scipy

Component wise comparision of float values in Numpy is returning wrong result

I am getting wrong results when I do an element wise comparison on a numpy array of floats.
For eg:
import numpy as np
a = np.arange(4, 5 + 0.025, 0.025)
print a
mask = a==5.0
print mask
na = a[mask]
print na
When I run the above code, a == 5.0 doesn't give me a True value for the
index where the value is in fact 5.0
I also tried setting the dtype of array to numpy.double thinking it could
be a floating point precision issue but it still returns me wrong result.
I am pretty sure I am missing something here....can anyone point me to right direction or tell me what's wrong with the code above?
Thanks!

There is an imprecision here when using float types, use np.isclose to compare an array against a scalar float value:
In [50]:
mask = np.isclose(a,5.0)
print(mask)
na = a[mask]
na
[False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False True False]
Out[50]:
array([ 5.])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare rows of a numpy array to all rows - python

Related

Index of last occurence of True in every row

Pandas rolling: aggregate boolean values

Return multiple columns based on date range using pandas

Get external coordinates of a polygon from a numpy boolean grid

Component wise comparision of float values in Numpy is returning wrong result

Categories

Resources