I'm trying to see if numpy.histogram2d will cross tabulate data in 2 arrays for me. I've never used this function before and I'm getting an error I don't know how to fix.
import numpy as np
import random
zones = np.zeros((20,30), int)
values = np.zeros((20,30), int)
for i in range(20):
for j in range(30):
values[i,j] = random.randint(0,10)
zones[:8,:15] = 100
zones[8:,:15] = 101
zones[:8,15:] = 102
zones[8:,15:] = 103
np.histogram2d(zones,values)
This code results in the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-53447df32000> in <module>()
----> 1 np.histogram2d(zones,values)
C:\Python27\ArcGISx6410.2\lib\site-packages\numpy\lib\twodim_base.pyc in histogram2d(x, y, bins, range, normed, weights)
613 xedges = yedges = asarray(bins, float)
614 bins = [xedges, yedges]
--> 615 hist, edges = histogramdd([x,y], bins, range, normed, weights)
616 return hist, edges[0], edges[1]
617
C:\Python27\ArcGISx6410.2\lib\site-packages\numpy\lib\function_base.pyc in histogramdd(sample, bins, range, normed, weights)
279 # Sample is a sequence of 1D arrays.
280 sample = atleast_2d(sample).T
--> 281 N, D = sample.shape
282
283 nbin = empty(D, int)
ValueError: too many values to unpack
Here is what I am trying to accomplish:
I have 2 arrays. One array comes from a geographic dataset (raster) representing Landcover classes (e.g. 1=Tree, 2=Grass, 3=Building, etc.). The other array comes from a geographic dataset (raster) representing some sort of political boundary (e.g. parcels, census blocks, towns, etc). I am trying to get a table that lists each unique political boundary area (array values represent a unique id) as rows and the total number of pixels within each boundary for each landcover class as columns.
I'm assuming values is the landcover and zones is the political boundaries. You might want to use np.bincount, which is like a special histogram where each bin has spacing and width of exactly one.
import numpy as np
zones = np.zeros((20,30), int)
zones[:8,:15] = 100
zones[8:,:15] = 101
zones[:8,15:] = 102
zones[8:,15:] = 103
values = np.random.randint(0,10,(20,30)) # no need for that loop
tab = np.array([np.bincount(values[zones==zone]) for zone in np.unique(zones)])
You can do this more simply with histogram, though, if you are careful with the bin edges:
np.histogram2d(zones.flatten(), values.flatten(), bins=[np.unique(zones).size, values.max()-values.min()+1])
The way this works is as follows. The easiest example is to look at all values regardless of zone:
np.bincount(values)
Which gives you one row with the counts for each value (0 to 10). The next step is to look at the zones.
For one zone, you'd have just one row, and it would be:
zone = 101 # the desired zone
mask = zone==zones # a mask that is True wherever your zones map matches the desired zone
np.bincount(values[mask]) # count the values where the mask is True
Now, we just want to do this for each zone in the map. You can get a list of the unique values in your zones map with
zs = np.unique(zones)
and loop through it with a list comprehension, where each item is one of the rows as above:
tab = np.array([np.bincount(values[zones==zone]) for zone in np.unique(zones)])
Then, your table looks like this:
print tab
# elements with cover =
# 0 1 2 3 4 5 6 7 8 9 # in zone:
[[16 11 10 12 13 15 11 7 13 12] # 100
[13 23 15 16 24 16 24 21 15 13] # 101
[10 12 23 13 12 11 11 5 11 12] # 102
[19 25 20 12 16 19 13 18 22 16]] # 103
Finally, you can plot this in matplotlib as so:
import matplotlib.pyplot as plt
plt.hist2d(zones.flatten(), values.flatten(), bins=[np.unique(zones).size, values.max()-values.min()+1])
histogram2d expects 1D arrays as input, and your zones and values are 2D. You could linearize them with ravel:
np.histogram2d(zones.ravel(), values.ravel())
If efficiency isn't a concern, I think this works for what you want to do
from collections import Counter
c = Counter(zip(zones.flat[:], landcover_classes.flat[:]))
c will contain key/val tuples where the key is a tuple of (zone, landcover class). You can populate an array if you like with
for (i, j), count in c.items():
my_table[i, j] = count
That only works, of course, if i and j are sequential integers starting at zero (i.e., from 0 to Ni and 0 to Nj).
Related
I have an array of repeated values that are used to match datapoints to some ID.
How can I replace the IDs with counting up index values in a vectorized manner?
Consider the following minimal example:
import numpy as np
n_samples = 10
ids = np.random.randint(0,500, n_samples)
lengths = np.random.randint(1,5, n_samples)
x = np.repeat(ids, lengths)
print(x)
Output:
[129 129 129 129 173 173 173 207 207 5 430 147 143 256 256 256 256 230 230 68]
Desired solution:
indices = np.arange(n_samples)
y = np.repeat(indices, lengths)
print(y)
Output:
[0 0 0 0 1 1 1 2 2 3 4 5 6 7 7 7 7 8 8 9]
However, in the real code, I do not have access to variables like ids and lengths, but only x.
It does not matter what the values in x are, I just want an array with counting up integers which are repeated the same amount as in x.
I can come up with solutions using for-loops or np.unique, but both are too slow for my use case.
Has anyone an idea for a fast algorithm that takes an array like x and returns an array like y?
You can do:
y = np.r_[False, x[1:] != x[:-1]].cumsum()
Or with one less temporary array:
y = np.empty(len(x), int)
y[0] = 0
np.cumsum(x[1:] != x[:-1], out=y[1:])
print(y)
I have a dataset with some rows containing singular answers and others having multiple answers. Like so:
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
With the singular answers I managed to create a heatmap using df.corr(), but I can't figure out what is the best approach for multiple answers rows.
I could split them and add additional columns for each answer like:
year length Animation
0 1971 121 1
1 1971 121 2
2 1971 121 3
3 1939 71 1
4 1939 71 3 ...
and then do the exact same dr.corr(), or add additional Animation_01, Animation_02 ... columns, but there must be a smarter way to work around this issue?
EDIT: Actual data snippet
You should compute a frequency table between two categorical variables using pd.crosstab() and perform subsequent analyses based on this table. df.corr(x, y) is NOT mathematically meaningful when one of x and y is categorical, no matter it is encoded into number or not.
N.B.1 If x is categorical but y is numerical, there are two options to describe the linkage between them:
Group y into quantiles (bins) and treat it as categorical
Perform a linear regression of y against one-hot encoded dummy variables of x
Option 2 is more precise in general but the statistics is beyond the scope of this question. This post will focus on the case of two categorical variables.
N.B.2 For sparse matrix output please see this post.
Sample Solution
Data & Preprocessing
import pandas as pd
import io
import matplotlib.pyplot as plt
from seaborn import heatmap
df = pd.read_csv(io.StringIO("""
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
"""), sep=r"\s{2,}", engine="python")
# convert string to list
df["Animation"] = df["Animation"].str.split(',')
# expand list column into new rows
df = df.explode("Animation")
# (optional)
df["Animation"] = df["Animation"].astype(int)
Frequency Table
Note: grouping of length is ignored for simplicity
ct = pd.crosstab(df["Animation"], df["length"])
print(ct)
# Out[65]:
# length 7 70 71 121
# Animation
# 0 1 1 1 0
# 1 0 1 1 1
# 2 1 1 1 1
# 3 0 0 2 1
Visualization
ax = heatmap(ct, cmap="viridis",
yticklabels=df["Animation"].drop_duplicates().sort_values(),
xticklabels=df["length"].drop_duplicates().sort_values(),
)
ax.set_title("Title", fontsize=20)
plt.show()
Example Analysis
Based on the frequency table, you can ask questions about the distribution of y given a certain (subset of) x value(s), or vice versa. This should better describe the linkage between two categorical variables, as the categorical variables have no order.
For example,
Q: What length does Animation=3 produces?
A: 66.7% chance to give 71
33.3% chance to give 121
otherwise unobserved
You want to break Animation (or Preferred_positions in your data snippet) up into a series of one-hot columns, one one-hot column for every unique string in the original column. Every column with have values of either zero or one, one corresponding to rows where that string appeared in the original column.
First, you need to get all the unique substrings in Preferred_positions (see this answer for how to deal with a column of lists).
positions = df.Preferred_positions.str.split(',').sum().unique()
Then you can create the positions columns in a loop based on whether the given position is in Preferred_positions for each row.
for position in positions:
df[position] = df.Preferred_positions.apply(
lambda x: 1 if position in x else 0
)
I am analysing Datasets and I need to compare them. The two Datasets got an Index and Coordinates(X,Y) each. The coordinates are not equal so I need to use something like the numpy.isclose (e.g. atol=5) function.
My aim in the comparison is to find similar y coordinates (e.g. y[5]= 101 (Dataset1), y2[7] = 103 (Dataset2)). And I need to compare the x-coordinates of the same indices (e.g. x[5]= 405 (Dataset1), x2[7] = 401 (Dataset2))
My problem is that I cant combine these two isclose functions
I have tried to compare at first the y and afterwards the x coordinates. If it is a separate comparison the function will find other Data as well. (e.g. y[5] = 101, y2[7] = 103; x[5] = 405, x[3] = 402). It needs to compare same indices (5/5 and 7/7).
This is working but gives wrong results:
yres = {i for i in yvar if numpy.isclose(yvar2, i, atol= 5).any()}
xres = {i for i in xvar if numpy.isclose(xvar2, i, atol= 5).any()}
Theoretically i am searching for something like this:
yres = {i for i in yvar if numpy.isclose(yvar2, i, atol= 5).any() & i for i in xvar if numpy.isclose(xvar2, i, atol= 5).any()}
Expect finding points with similar coordinates
(e.g. y[5]=101, y2[7] = 103 ; x[5] = 405 , x2[7] = 401).
At the moment I receive any similar data
(e.g. y[5]=101, y2[7] = 103 ; x[5] = 405 , x2[3] = 402).
Bellow Input example (Picture1 and Picture2):
Pict1
Pict2
In this picture I need to identify 4 point pairs (Index pict1 / Index pict2):
6 / 9
7 / 8
17 / 13
20 / 14
Forewords
Your question is related to Nearest Neighbors Search (NNS).
One way to solve it is to build a spatial index like in Spatial Databases.
A straightforward solution is KD-Tree which is implemented in sklearn.
Questions
At this point it is essential to know what question we want to answer:
Q1.a) Find all points in dataset B which are as close as (distance) points of A within a given threshold atol (radius).
Or:
Q2.a) Find the k closest point in a dataset B with respect to each point of my dataset A.
Both questions can be answered using KD-Tree, what we must realise is:
Questions Q1 and Q2 are different, so are their answers;
Q1 can map 0 or more points together, there is no guaranty about one-to-one mapping;
Q2 will map exactly 1 to k points, there is a guaranty that all points in reference dataset are mapped to k points in search dataset (provided there is enough points);
Q2.a is generally not equivalent to its reciprocal question Q2.b (when datasets A and B are permuted).
MCVE
Lets build a MCVE to address both questions:
# Parameters
N = 50
atol = 50
keys = ['x', 'y']
# Trials Datasets (with different sizes, we keep it general):
df1 = pd.DataFrame(np.random.randint(0, 500, size=(N-5, 2)), columns=keys).reset_index()
df2 = pd.DataFrame(np.random.randint(0, 500, size=(N+5, 2)), columns=keys).reset_index()
# Spatial Index for Datasets:
kdt1 = KDTree(df1[keys].values, leaf_size=5, metric='euclidean')
kdt2 = KDTree(df2[keys].values, leaf_size=5, metric='euclidean')
# Answer Q2.a and Q2.b (searching for a single neighbour):
df1['kNN'] = kdt2.query(df1[keys].values, k=1, return_distance=False)[:,0]
df2['kNN'] = kdt1.query(df2[keys].values, k=1, return_distance=False)[:,0]
# Answer Q1.a and Q1.b (searching within a radius):
df1['radius'] = kdt2.query_radius(df1[keys].values, atol)
df2['radius'] = kdt1.query_radius(df2[keys].values, atol)
Bellow the result for Dataset A as reference:
index x y kNN radius
0 0 65 234 39 [39]
1 1 498 49 11 [11]
2 2 56 171 19 [29, 19]
3 3 239 43 20 [20]
4 4 347 32 50 [50]
[...]
At this point, we have everything required to spatially join our data.
Nearest Neighbors (k=1)
We can join our datasets using kNN index:
kNN1 = df1.merge(df2[['index'] + keys], left_on='kNN', right_on='index', suffixes=('_a', '_b'))
It returns:
index_a x_a y_a kNN radius index_b x_b y_b
0 0 65 234 39 [39] 39 49 260
1 1 498 49 11 [11] 11 487 4
2 2 56 171 19 [29, 19] 19 39 186
3 3 239 43 20 [20] 20 195 33
4 4 347 32 50 [50] 50 382 32
[...]
Graphically it leads to:
And reciprocal question is about:
We see that mapping is exactly 1-to-k=1 all points in reference dataset are mapped to another point in search dataset. But answers differ when we swap reference.
Radius atol
We can also join our datasets using the radius index:
rad1 = df1.explode('radius')\
.merge(df2[['index'] + keys], left_on='radius', right_on='index',
suffixes=('_a', '_b'))
It returns:
index_a x_a y_a kNN radius index_b x_b y_b
0 0 65 234 39 39 39 49 260
2 1 498 49 11 11 11 487 4
3 2 56 171 19 29 29 86 167
4 2 56 171 19 19 19 39 186
7 3 239 43 20 20 20 195 33
[...]
Graphically:
Reciprocal answer is equivalent:
We see answers are identical, but there is no guaranty for a one-to-one mapping. Some points are not mapped (lonely points), some are mapped to many points (dense neighbourhood). Additionally, it requires an extra parameters atol which must be tuned for a given context.
Bonus
Bellow the function to render figures:
def plot(A, B, join, title=''):
X = join.loc[:,['x_a','x_b']].values
Y = join.loc[:,['y_a','y_b']].values
fig, axe = plt.subplots()
axe.plot(A['x'], A['y'], 'x', label='Dataset A')
axe.plot(B['x'], B['y'], 'x', label='Dataset B')
for k in range(X.shape[0]):
axe.plot(X[k,:], Y[k,:], linewidth=0.75, color='black')
axe.set_title(title)
axe.set_xlabel(r'$x$')
axe.set_ylabel(r'$y$')
axe.grid()
axe.legend(bbox_to_anchor=(1,1), loc='upper left')
return axe
References
Some useful references:
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html
https://en.wikipedia.org/wiki/Nearest_neighbor_search
https://en.wikipedia.org/wiki/K-d_tree
https://scikit-learn.org/stable/modules/neighbors.html
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius
I have this series:
data:
0 17
1 25
2 10
3 60
4 0
5 20
6 300
7 50
8 10
9 80
10 100
11 65
12 125
13 50
14 100
15 150
Name: 1, dtype: int64
I wanted to plot an histogram with variable bin size, so I made this:
filter_values = [0,25,50,60,75,100,150,200,250,300,350]
out = pd.cut(data."1", bins = filter_values)
counts = pd.value_counts(out)
print(counts)
My problem is that when I use counts.plot(kind="hist"), i have not the good label for x axis. I only get them by using a bargraph instead counts.plot(kind="bar"), but I can't get the right order then.
I tried to use xticks=counts.index.values[0] but it makes an error, and xticks=filter_values give an odd figure shape as the numbers go far beyond what the plot understand the bins to be.
I also tried counts.hist(), data.hist(), and counts.plot.hist without success.
I don't know how to plot correctly the categorical data from counts (it includes as index a pandas categorical index) so, I don't know which process I should apply, if there is a way to plot variable bins directly in data.hist() or data.plot(kind="hist") or data.plot.hist(), or if I am right to build counts, but then how to represent this correctly (with good labels on the xaxis and the right order, not a descending one as in the bar graph.
How can I find all the points (or the region) in an image which has 3-dimensions in which the first two dimensions show the resolution and the 3rd one shows the density? I can use Matlab or Python. I wonder if there is a native function for finding those points that is least computationally expensive.
UPDATE:
Imagine I have the following:
A= [1,2,3; 4,6,6; 7,6,6]
A =
1 2 3
4 6 6
7 6 6
>> B=[7,8,9; 10,11,11; 1, 11,11]
B =
7 8 9
10 11 11
1 11 11
>> C=[0,1,2; 3, 7, 7; 5,7,7]
C =
0 1 2
3 7 7
5 7 7
How can I find the lower square in which all the values of A equal the same all the values of B and all the values of C? If this is too much how can I find the lower square in A wherein all the values in A are equal?
*The shown values are the intensity of the image.
UPDATE: tries the provided answer and got this error:
>> c=conv2(M,T, 'full');
Warning: CONV2 on values of class UINT8 is obsolete.
Use CONV2(DOUBLE(A),DOUBLE(B)) or CONV2(SINGLE(A),SINGLE(B)) instead.
> In uint8/conv2 (line 10)
Undefined function 'conv2' for input arguments of type 'double' and attributes 'full 3d real'.
Error in uint8/conv2 (line 17)
y = conv2(varargin{:});
*Also tried convn and it took forever so I just stopped it!
Basically how to do this for a 2D array as described above?
A possible solution:
A = [1,2,3; 4,6,6; 7,6,6];
B = [7,8,9; 10,11,11; 1, 11,11];
C = [0,1,2; 3, 7, 7; 5,7,7];
%create a 3D array
D = cat(3,A,B,C)
%reshape the 3D array to 2D
%its columns represent the third dimension
%and its rows represent resolution
E = reshape(D,[],size(D,3));
%third output of the unique function applied row-wise to the data
%represents the label of each pixel a [m*n, 1] vector created
[~,~,F] = unique(E,'rows');
%reshape the vector to a [m, n] matrix of labels
result = reshape(F, size(D,1), size(D,2));
You can reshape the 3D matrix to a 2D matrix (E) that its columns represent the third dimension and its rows represent resolution.
Then using unique function you can label the image.
We have a 3D matrix:
A =
1 2 3
4 6 6
7 6 6
B =
7 8 9
10 11 11
1 11 11
C =
0 1 2
3 7 7
5 7 7
When we reshape the 3D matrix to a 2D matrix E we get:
E =
1 7 0
4 10 3
7 1 5
2 8 1
6 11 7
6 11 7
3 9 2
6 11 7
6 11 7
So we need to classify the rows base on their values.
Unique function is capable of extracting unique rows and assign the same label to rows that are equal to each other.
Here varible F capture third output of the unique function that is label of each row.
F =
1
4
6
2
5
5
3
5
5
that should be reshaped to 2D
result =
1 2 3
4 5 5
6 5 5
so each region has different label.
If you want to segment distinct regions(based on both their values and their spatial positions) you need to do labeling the image in a loop
numcolors = max(F);
N = 0;
segment = zeros(size(result));
for c = 1 : numcolors
[label,n] = bwlabel(result==c);
segment = segment +label + logical(label)*N;
N = N + n;
end
So here you need to mark disconnected regions that have the same values with different labels. since MATLAB doesn't have functions for gray segmentation You can use bwlabel function multiple times to do segmentation and add result of the previous iteration to result of current iteration. segment variable contains the segmentd image.
*Note: this result obtained from GNU Octave that its labeling is different from MATLAB. if You use unique(E,'rows','last'); result of MATLAB and Octave will be the same.
You can use a pair of horizontal and vertical 1D filters such that the horizontal filter has a kernel of [1 -1] while the vertical filter has a kernel of [1; -1]. The effect of this is that it takes both horizontal and vertical pairwise distances for each element in each dimension separately. You can then perform image filtering or convolution using these two kernels ensuring that you replicate the borders. To be able to find uniform regions, by checking which regions in both results map to 0 between them both, this gives you areas where areas that are uniform over all channels independently.
To do this, you would first take the opposite of both filtering results so that uniform regions that would become 0 are now 1 and vice-versa. that you perform the logical AND operation on both of these together and then ensure that for each pixel temporally, all of the values are true. This would mean that for a spatial location in this image, all values experience the same uniformity as you expect.
In MATLAB, assuming you have the Image Processing Toolbox, use imfilter to filter the images, then use all in MATLAB to look temporally after the two filtering results, and then use regionprops to find the coordinates of the regions you seek. So do something like this:
%# Reproducing your data
A = [1,2,3; 4,6,6; 7,6,6];
B = [7,8,9; 10,11,11; 1, 11,11];
C = [0,1,2; 3, 7, 7; 5,7,7];
%# Create a 3D matrix to allow for efficient filtering
D = cat(3, A, B, C);
%# Filter using the kernels
ker = [1 -1];
ker2 = ker.'; %#
out = imfilter(D, ker, 'replicate');
out2 = imfilter(D, ker2, 'replicate');
%# Find uniform regions
regions = all(~out & ~out2, 3);
%# Determine the locations of the uniform areas
R = regionprops(regions, 'BoundingBox');
%# Round to ensure pixel accuracy and reshape into a matrix
coords = round(reshape([R.BoundingBox], 4, [])).';
coords would be a N x 4 matrix with each row telling the upper-left coordinates of the bounding box origin as well as the width and height of the bounding box. The first and second elements in a row are the column and row coordinate while the third and fourth elements are the width and height of the bounding box.
The regions we have detected can be found in the regions variable. Both of these show:
>> regions
regions =
3×3 logical array
0 0 0
0 1 1
0 1 1
>> coords
coords =
2 2 2 2
This tells us that we have localised the region of "uniformity" to be the bottom right corner while the coordinates of the top-left corner of the bounding box are row 2, column 2 with a width and height of 2 and 2 respectively.
check out https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.signal.correlate2d.html
2D correlation basically "slides" the two images across each other, and adds up the dot product of the overlap.
more reading: http://www.cs.umd.edu/~djacobs/CMSC426/Convolution.pdf
https://en.wikipedia.org/wiki/Two-dimensional_correlation_analysis