Get external coordinates of a polygon from a numpy boolean grid - python

I am trying to get the external coordinates of a polygon from a numpy boolean grid. For example, from a (16, 16) ndarray such as the following one
[
[False False False False False False True True True True False False False False False False],
[False False False False False True True True True True True False False False False False],
[False False False False False False False False False False True True False False False False],
[False False False False False False False False False False False True False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False],
[False False False False False False False False False False False False False False False False]
]
If we plot that ndarray it will be like the following:
I would like to get the following coordinates, in order, such that we could draw the external ring of such polygon, e.g., [(5 1), (6 0), (7 0), (8 0), (9 0), (10 1), (11 2), (11 3), (10 2), (9 1), (8 1), (7 1), (6 1)]. What I have so far is the following:
# Consider that the boolean ndarray above is called 'prediction'
import numpy as np
from shapely.geometry import Polygon, Point
import matplotlib.pyplot as plt
# Get the coordinates that match the boolean polygon
(y, x) = np.where(prediction == True)
# Iterate on each of the coordinates, however my problem is that it is not aware of the contour order as it should be :/
coordinates = [Point(x_coordinate, y_coordinate) for x_coordinate, y_coordinate in itertools.izip(x, y)]
# Build the polygon out of the points
polygon = Polygon([[coordinate.x, coordinate.y] for coordinate in coordinates])
exterior_x, exterior_y = polygon.exterior.xy
# Plotting
fig = plt.figure(1, figsize=(5, 5))
ax = fig.add_subplot(1, 2, 1)
ax.plot(exterior_x, exterior_y, color='#6699cc')
ax.invert_yaxis()
plt.subplot(1, 2, 2)
plt.imshow(prediction)
plt.show()
The problem is that I am building the polygon not considering the order so that the result of polygon.exterior.xy will create the external ring. My approach will create the wrong contour of the polygon such as:
However, I am unable to come up with a general approach for this problem. I welcome any suggestion on how to tackle this problem. Thanks in advance.

Perhaps you can move the question to GIS stack exchange site. There you will probably get more help on this.
Anyway, a quick search shows this anwer, where it is suggested to use rasterio library, which I understand is what you need.
Adapted to your case, it can be something as:
import numpy as np
import rasterio.features
# Convert your array to 0-1 integers
myarray = [[1 if t else 0 for t in row] for row in myarray]
# Build a numpy array
myarray = np.array(myarray)
# Convert the type (don't even know why this was needed in my computer, but raised exception if not converted.
myarray = myarray.astype(np.int32)
# Let the library do the magic. You should take a look at the rasterio.features.shapes output
mypols = [p[0]['coordinates']
for p in rasterio.features.shapes(myarray)]
mypols is now an array of coordinates that you can easily convert to shapely Polygons.
Beware of properly testing stranger cases. I tried to build a multipolygon, and the library returned each connected component as a polygon. Fortunately, it returns for each polygon the associated value, so you can post process as you like.
Polygons with interior rings seem to be handled OK, though.
I don't know what is the behavior you would expect in those cases.

I'd use ConvexHull, it tries to find the smallest enveloppe which contains all your points, that would be the polygon contour : ConvexHull with Scipy

Related

compare rows of a numpy array to all rows

I'm trying to compare each row of a numpy array with the whole numpy array without using iteration.
>>> sample = np.array([[1,2,3],[4,5,6]])
>>> sample
array([[1, 2, 3],
[4, 5, 6]])
First I reshape the 2D-array to a 3D-array:
>>> sample2=sample.reshape(sample.shape[0],1,sample.shape[1])
And then with the following line of code I can compare the rows:
>>> sample2 == sample
array([[[ True, True, True],
[False, False, False]],
[[False, False, False],
[ True, True, True]]])
...which is the result that I'm looking for.
But this does not work with large numpy arrays:
>>> sample3 = np.random.randint(low= 0, high = 2, size = 30000000).reshape(30000,1000)
>>> sample4 = sample3.reshape(sample3.shape[0],1,sample3.shape[1])
>>> sample4 == sample3
<ipython-input-229-e1d55c6bb1ca>:1: DeprecationWarning: elementwise
comparison failed; this will raise an error in the future.
False
How can I solve this?
This may shed some light on your question. Here is my code sample, based on yours:
import numpy as np
n=30000000
ny = 1000
sample3 = np.random.randint(low= 0, high = 2, size = n).reshape(n // ny, ny)
sample4 = sample3.reshape(sample3.shape[0],1,sample3.shape[1])
print(sample3.shape, sample4.shape)
test = sample4 == sample3
print(test)
test = np.equal(sample4, sample3)
print(test)
Its output is:
(30000, 1000) (30000, 1, 1000)
C:\Users\XYZ\python\code_sample.py:7: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
test = sample4 == sample3
False
Traceback (most recent call last):
File "code_sample.py", line 9, in <module>
test = np.equal(sample4, sample3)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 838. GiB for an array with shape (30000, 30000, 1000) and data type bool
Also, here are the docs for numpy.equal() which is presumably used by the == operator for numpy arrays. They sate:
Input arrays. If x1.shape != x2.shape, they must be broadcastable to a common shape (which becomes the shape of the output).
So it looks like equal() may be attempting to use a substantial amount of memory (838 GB in the example above). Perhaps == decides to fail and give the deprecation warning (rather than something more apt, such as an out-of-memory error) when it realizes there's not enough memory?
Also, if I reduce n from 30000000 to 3000000 and comment out the call to equal(), execution of the == statement takes 10 or 20 seconds before the following result is printed:
(3000, 1000) (3000, 1, 1000)
[[[ True True True ... True True True]
[False True True ... True True True]
[ True True True ... True True True]
...
[False True True ... True False True]
[False False True ... True True False]
[ True False False ... False False False]]
[[False True True ... True True True]
[ True True True ... True True True]
[False True True ... True True True]
...
[ True True True ... True False True]
[ True False True ... True True False]
[False False False ... False False False]]
[[ True True True ... True True True]
[False True True ... True True True]
[ True True True ... True True True]
...
[False True True ... True False True]
[False False True ... True True False]
[ True False False ... False False False]]
...
[[False True True ... True False True]
[ True True True ... True False True]
[False True True ... True False True]
...
[ True True True ... True True True]
[ True False True ... True False False]
[False False False ... False True False]]
[[False False True ... True True False]
[ True False True ... True True False]
[False False True ... True True False]
...
[ True False True ... True False False]
[ True True True ... True True True]
[False True False ... False False True]]
[[ True False False ... False False False]
[False False False ... False False False]
[ True False False ... False False False]
...
[False False False ... False True False]
[False True False ... False False True]
[ True True True ... True True True]]
So it looks like the issue you've encountered is probably related to running out of memory.

Index of last occurence of True in every row

I have a 2D array:
a = ([[False False False False False True True True True True True True
True True True True True True True True True True True True
True False False False]
[False False False False True True True True True True True True
True True True True True True True True True True True True
False False False False]])
I am trying to get the index of last occurrence of 'True' in every row.
So resulting array should be
b = ([24, 23])
To find the occurence of first True, I know i can use the np.argmax().
b = np.argmax(a==True,axis=1)
Is there a function to find from the last? I tried reversing the values of array and then using np.argmax(), but it will give the index of the reversed array.
Process each row in the backwards direction and subtract the
result from the row length - 1:
result = a.shape[1] - np.argmax(a[:, ::-1], axis=1) - 1
Even == True is not needed.
The result is:
array([24, 23], dtype=int64)
import numpy as np
a = np.array([[False ,False, False, False, False ,True , True , True , True ,True , True , True,
True , True , True , True, True , True ,True ,True ,True ,True , True, True,
True, False ,False, False],
[False ,False, False, False, False ,True , True , True , True ,True , True , True,
True , True , True , True, True , True ,True ,True ,True ,True , True, True,
False, False ,False, False]])
b = a[...,::-1]
c = [len(i) - np.argmax(i) - 1 for i in b]
print(c)
list 'c' will have indices with last True value

Python - isnull().sum() vs isnull().count()

So I'm currently finishing a tutorial with the titanic dataset (https://www.kaggle.com/c/titanic/data).
Now I'm trying a couple of new things that might be related.
The info for it is :
There are 891 entries(red asterisk), and columns with NaN values (blue dashes).
When I went to find a little summary of the missing values I got confused by .sum() & .count():
In the above code, .sum() is incrementing by one for each instance of a null value. So it seems that the output is the value of how many missing entries there are for each column in the data frame. (which is what I want)
However if we do .count() we get 891 for each column no matter if we use .isnull().count() or .notnull().count().
So my question(s) is :
What does .count() mean in this context?
I thought that it would count every instance of the wanted method (in this case every instance of a null or not null entry; basically what .sum() did).
Also; my "definition" of how .sum() is being used, is that correct?
Just print out the data of train_df.isnull(), you will see it.
# data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
train_df = pd.read_csv('train.csv')
print(train_df.isnull())
result:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
.. ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False
887 False False False False False False False False False
888 False False False False False True False False False
889 False False False False False False False False False
890 False False False False False False False False False
It got 891 rows, full of Trues and False.
When you use sum(), it will return the sum of every column, which adds trues(=1) and false(= 0) together.Like this
print(False+False+True+True)
2
When you use count(), it just returns the number of rows.
Of course, you will get 891 for each column no matter you use .isnull().count() or .notnull().count().
data.isnull().count() returns the total number of rows irrespective of missing values. You need to use data.isnull().sum().

Return multiple columns based on date range using pandas

I'm basically trying to calculate revenue to date using pandas. I would like to return N columns consisting of each quarter end. Each column would calculate total revenue to date as of that quarter end. I have:
df['Amortization_per_Day'] = (2.5, 3.2, 5.5, 6.5, 9.2)
df['Start_Date'] = ('1/1/2018', '2/27/2018', '3/31/2018', '5/23/2018', '6/30/2018')
Date_Range = pd.date_range('10/31/2017', periods=75, freq='Q-Jan')
and want to do something like:
df['Amortization_per_Day'] * (('Date_Range' - df['Start_Date']).dt.days + 1)
for each date within the Date_Range. I'm not sure how to pass the Date_Range through the function and to return N columns. I've been reading about zip(*df) and shift but not fully grasping it. Thank you so much for your help.
Solution
Here's a complete solution:
from datetime import datetime
import pandas as pd
df = pd.DataFrame()
df['Amortization_per_Day'] = (2.5, 3.2, 5.5, 6.5, 9.2)
df['Start_Date'] = ('1/1/18', '2/27/18', '3/31/18', '5/23/2018', '6/30/2018')
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
dr = pd.date_range('10/31/2017', periods=75, freq='Q-Jan')
def betweendates(x, y):
xv = x.values.astype('datetime64[D]')
xpad = np.zeros(xv.size + 2, dtype=xv.dtype)
xpad[1:-1] = xv
xpad[0],xpad[-1] = np.datetime64(datetime.min), np.datetime64(datetime.max)
yv = y.values.astype('datetime64[D]')
return (xpad[:-1] <= yv[:,None]) & (xpad[1:] >= yv[:,None])
# get a boolean array that indicates which dates in dr are in between which dates in df['Start_Date']
btwn = betweendates(df['Start_Date'], dr)
# based on the boolean array btwn, select out the salient rows from df and dates from dr
dfsel = df[btwn[:, 1:].T]
drsel = dr[btwn[:, 1:].sum(axis=1, dtype=bool)]
# do the actual calculation the OP wanted
dfsel['Amortization_per_Day'] * ((drsel - dfsel['Start_Date']).dt.days + 1)
Output:
0 77.5
2 170.5
4 294.4
4 1140.8
4 1987.2
4 2806.0
4 3652.4
4 4498.8
4 5345.2
4 6173.2
...
4 52394.0
4 53212.8
4 54059.2
4 54905.6
4 55752.0
4 56570.8
4 57417.2
4 58263.6
4 59110.0
4 59938.0
Length: 74, dtype: float64
Explanation
The boolean btwn array looks like this:
[[ True False False False False False]
[False True False False False False]
[False False False True False False]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
[False False False False False True]
...
The ith row of btwn corresponds to the ith datetime in your date range. In each row, exactly one value will be True, and the others will be False. A True value in the 0th column indicates that the datetime is before any of the Start_Times, a True value in the 1st column indicates that the datetime is in between the 0th and the 1st dates in Start_Times, and so forth. A True value in the last column indicates that the datetime is after any of the Start_Times.
By slicing btwn like this:
btwn[:, 1:]
it can be used to match up datetimes in your date range with the immediately preceding Start_Time. If you instead change the slices of btwn to be like this:
btwn[:, :-1]
you would end up matching each datetime to the next Start_Time instead.

Component wise comparision of float values in Numpy is returning wrong result

I am getting wrong results when I do an element wise comparison on a numpy array of floats.
For eg:
import numpy as np
a = np.arange(4, 5 + 0.025, 0.025)
print a
mask = a==5.0
print mask
na = a[mask]
print na
When I run the above code, a == 5.0 doesn't give me a True value for the
index where the value is in fact 5.0
I also tried setting the dtype of array to numpy.double thinking it could
be a floating point precision issue but it still returns me wrong result.
I am pretty sure I am missing something here....can anyone point me to right direction or tell me what's wrong with the code above?
Thanks!
There is an imprecision here when using float types, use np.isclose to compare an array against a scalar float value:
In [50]:
mask = np.isclose(a,5.0)
print(mask)
na = a[mask]
na
[False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False True False]
Out[50]:
array([ 5.])

Categories

Resources