How to join dataset on coordinates?

How to join dataset on coordinates? - python

I am analysing Datasets and I need to compare them. The two Datasets got an Index and Coordinates(X,Y) each. The coordinates are not equal so I need to use something like the numpy.isclose (e.g. atol=5) function.
My aim in the comparison is to find similar y coordinates (e.g. y[5]= 101 (Dataset1), y2[7] = 103 (Dataset2)). And I need to compare the x-coordinates of the same indices (e.g. x[5]= 405 (Dataset1), x2[7] = 401 (Dataset2))
My problem is that I cant combine these two isclose functions
I have tried to compare at first the y and afterwards the x coordinates. If it is a separate comparison the function will find other Data as well. (e.g. y[5] = 101, y2[7] = 103; x[5] = 405, x[3] = 402). It needs to compare same indices (5/5 and 7/7).
This is working but gives wrong results:
yres = {i for i in yvar if numpy.isclose(yvar2, i, atol= 5).any()}
xres = {i for i in xvar if numpy.isclose(xvar2, i, atol= 5).any()}
Theoretically i am searching for something like this:
yres = {i for i in yvar if numpy.isclose(yvar2, i, atol= 5).any() & i for i in xvar if numpy.isclose(xvar2, i, atol= 5).any()}
Expect finding points with similar coordinates
(e.g. y[5]=101, y2[7] = 103 ; x[5] = 405 , x2[7] = 401).
At the moment I receive any similar data
(e.g. y[5]=101, y2[7] = 103 ; x[5] = 405 , x2[3] = 402).
Bellow Input example (Picture1 and Picture2):
Pict1
Pict2
In this picture I need to identify 4 point pairs (Index pict1 / Index pict2):
6 / 9
7 / 8
17 / 13
20 / 14

Forewords
Your question is related to Nearest Neighbors Search (NNS).
One way to solve it is to build a spatial index like in Spatial Databases.
A straightforward solution is KD-Tree which is implemented in sklearn.
Questions
At this point it is essential to know what question we want to answer:
Q1.a) Find all points in dataset B which are as close as (distance) points of A within a given threshold atol (radius).
Or:
Q2.a) Find the k closest point in a dataset B with respect to each point of my dataset A.
Both questions can be answered using KD-Tree, what we must realise is:
Questions Q1 and Q2 are different, so are their answers;
Q1 can map 0 or more points together, there is no guaranty about one-to-one mapping;
Q2 will map exactly 1 to k points, there is a guaranty that all points in reference dataset are mapped to k points in search dataset (provided there is enough points);
Q2.a is generally not equivalent to its reciprocal question Q2.b (when datasets A and B are permuted).
MCVE
Lets build a MCVE to address both questions:
# Parameters
N = 50
atol = 50
keys = ['x', 'y']
# Trials Datasets (with different sizes, we keep it general):
df1 = pd.DataFrame(np.random.randint(0, 500, size=(N-5, 2)), columns=keys).reset_index()
df2 = pd.DataFrame(np.random.randint(0, 500, size=(N+5, 2)), columns=keys).reset_index()
# Spatial Index for Datasets:
kdt1 = KDTree(df1[keys].values, leaf_size=5, metric='euclidean')
kdt2 = KDTree(df2[keys].values, leaf_size=5, metric='euclidean')
# Answer Q2.a and Q2.b (searching for a single neighbour):
df1['kNN'] = kdt2.query(df1[keys].values, k=1, return_distance=False)[:,0]
df2['kNN'] = kdt1.query(df2[keys].values, k=1, return_distance=False)[:,0]
# Answer Q1.a and Q1.b (searching within a radius):
df1['radius'] = kdt2.query_radius(df1[keys].values, atol)
df2['radius'] = kdt1.query_radius(df2[keys].values, atol)
Bellow the result for Dataset A as reference:
index x y kNN radius
0 0 65 234 39 [39]
1 1 498 49 11 [11]
2 2 56 171 19 [29, 19]
3 3 239 43 20 [20]
4 4 347 32 50 [50]
[...]
At this point, we have everything required to spatially join our data.
Nearest Neighbors (k=1)
We can join our datasets using kNN index:
kNN1 = df1.merge(df2[['index'] + keys], left_on='kNN', right_on='index', suffixes=('_a', '_b'))
It returns:
index_a x_a y_a kNN radius index_b x_b y_b
0 0 65 234 39 [39] 39 49 260
1 1 498 49 11 [11] 11 487 4
2 2 56 171 19 [29, 19] 19 39 186
3 3 239 43 20 [20] 20 195 33
4 4 347 32 50 [50] 50 382 32
[...]
Graphically it leads to:
And reciprocal question is about:
We see that mapping is exactly 1-to-k=1 all points in reference dataset are mapped to another point in search dataset. But answers differ when we swap reference.
Radius atol
We can also join our datasets using the radius index:
rad1 = df1.explode('radius')\
.merge(df2[['index'] + keys], left_on='radius', right_on='index',
suffixes=('_a', '_b'))
It returns:
index_a x_a y_a kNN radius index_b x_b y_b
0 0 65 234 39 39 39 49 260
2 1 498 49 11 11 11 487 4
3 2 56 171 19 29 29 86 167
4 2 56 171 19 19 19 39 186
7 3 239 43 20 20 20 195 33
[...]
Graphically:
Reciprocal answer is equivalent:
We see answers are identical, but there is no guaranty for a one-to-one mapping. Some points are not mapped (lonely points), some are mapped to many points (dense neighbourhood). Additionally, it requires an extra parameters atol which must be tuned for a given context.
Bonus
Bellow the function to render figures:
def plot(A, B, join, title=''):
X = join.loc[:,['x_a','x_b']].values
Y = join.loc[:,['y_a','y_b']].values
fig, axe = plt.subplots()
axe.plot(A['x'], A['y'], 'x', label='Dataset A')
axe.plot(B['x'], B['y'], 'x', label='Dataset B')
for k in range(X.shape[0]):
axe.plot(X[k,:], Y[k,:], linewidth=0.75, color='black')
axe.set_title(title)
axe.set_xlabel(r'$x$')
axe.set_ylabel(r'$y$')
axe.grid()
axe.legend(bbox_to_anchor=(1,1), loc='upper left')
return axe
References
Some useful references:
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html
https://en.wikipedia.org/wiki/Nearest_neighbor_search
https://en.wikipedia.org/wiki/K-d_tree
https://scikit-learn.org/stable/modules/neighbors.html
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius

Related

Group a column such that its sum is approximately equal in each group

What's the easiest way to sort evenly distributed values into a predefined number of groups?
data = {'impact':[10,30,20,10,90,60,50,40]}
df = pd.DataFrame(data,index=['a','b','c','d','e','f','g','h'])
print df
impact
a 10
b 30
c 20
d 10
e 90
f 60
g 50
h 40
numgroups = 4
group_targetsum = round(df.impact.sum() / numgroups, -1)
print group_targetsum
80.0
In the case above, I'd like to create 4 groups from df. The only sorting criteria is that the sum of impact in each group should be approximately equal to group_targetsum. impact sum can be above or below group_targetsum within a reasonable margin.
Ultimately, I'd like to separate these groups into their own dataframes, preserving index. Resulting in something like this:
print df_a
impact
e 90
print df_b
impact
c 20
f 60
print df_c
impact
a 10
d 10
g 50
print df_d
impact
b 30
h 40
Resulting dataframes don't need to be exactly this, just as long as they sum as close as possible to group_targetsum.

Assuming fairly similar values in the series, here's an approach using searchsorted -
In [150]: df
Out[150]:
impact
a 10
b 30
c 20
d 10
e 90
f 60
g 50
h 40
In [151]: a = df.values.ravel()
In [152]: shift_num = group_targetsum*np.arange(1,numgroups)
In [153]: idx = np.searchsorted(a.cumsum(), shift_num,'right')
In [154]: np.split(a, idx)
Out[154]: [array([10, 30, 20, 10]), array([90]), array([60]), array([50, 40])]

Conceptually we'd just like to use a weighted version of qcut, but that doesn't exist in pandas at this time. Nevertheless, we can accomplish the same thing by combining cumsum and cut. The cumsum essentially gives us the weighting, and we then slice it up with cut.
(Note about 'csum_midpoint': without the midpoint adjustment, we'll end up putting things into groups based on where it begins (in a cumulative sense) and hence end up with a bias towards binning in the higher groups. The midpoint adjustment can't make things perfectly even, but it helps. I believe this answer is mathematically the same as #Divakar's with the exception of my use of midpoint here and his use of 'right'.)
df['csum'] = df['impact'].cumsum()
df['csum_midpoint'] = (df.csum + df.csum.shift().fillna(0)) / 2.
df['grp'] = pd.cut( df.csum_midpoint, np.linspace(0,df['impact'].sum(),numgroups+1 ))
df.groupby( df.grp )['impact'].sum()
grp
(0, 77.5] 70
(77.5, 155] 90
(155, 232.5] 60
(232.5, 310] 90
Name: impact, dtype: int64
df
impact csum csum_midpoint grp
a 10 10 5.0 (0, 77.5]
b 30 40 25.0 (0, 77.5]
c 20 60 50.0 (0, 77.5]
d 10 70 65.0 (0, 77.5]
e 90 160 115.0 (77.5, 155]
f 60 220 190.0 (155, 232.5]
g 50 270 245.0 (232.5, 310]
h 40 310 290.0 (232.5, 310]

How to check correlation between matching columns of two data sets?

If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:

If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])

This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023

I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)

Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

Filling in missing data in Python

I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.

Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7

First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df

In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4

Trouble with numpy.histogram2d

I'm trying to see if numpy.histogram2d will cross tabulate data in 2 arrays for me. I've never used this function before and I'm getting an error I don't know how to fix.
import numpy as np
import random
zones = np.zeros((20,30), int)
values = np.zeros((20,30), int)
for i in range(20):
for j in range(30):
values[i,j] = random.randint(0,10)
zones[:8,:15] = 100
zones[8:,:15] = 101
zones[:8,15:] = 102
zones[8:,15:] = 103
np.histogram2d(zones,values)
This code results in the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-53447df32000> in <module>()
----> 1 np.histogram2d(zones,values)
C:\Python27\ArcGISx6410.2\lib\site-packages\numpy\lib\twodim_base.pyc in histogram2d(x, y, bins, range, normed, weights)
613 xedges = yedges = asarray(bins, float)
614 bins = [xedges, yedges]
--> 615 hist, edges = histogramdd([x,y], bins, range, normed, weights)
616 return hist, edges[0], edges[1]
617
C:\Python27\ArcGISx6410.2\lib\site-packages\numpy\lib\function_base.pyc in histogramdd(sample, bins, range, normed, weights)
279 # Sample is a sequence of 1D arrays.
280 sample = atleast_2d(sample).T
--> 281 N, D = sample.shape
282
283 nbin = empty(D, int)
ValueError: too many values to unpack
Here is what I am trying to accomplish:
I have 2 arrays. One array comes from a geographic dataset (raster) representing Landcover classes (e.g. 1=Tree, 2=Grass, 3=Building, etc.). The other array comes from a geographic dataset (raster) representing some sort of political boundary (e.g. parcels, census blocks, towns, etc). I am trying to get a table that lists each unique political boundary area (array values represent a unique id) as rows and the total number of pixels within each boundary for each landcover class as columns.

I'm assuming values is the landcover and zones is the political boundaries. You might want to use np.bincount, which is like a special histogram where each bin has spacing and width of exactly one.
import numpy as np
zones = np.zeros((20,30), int)
zones[:8,:15] = 100
zones[8:,:15] = 101
zones[:8,15:] = 102
zones[8:,15:] = 103
values = np.random.randint(0,10,(20,30)) # no need for that loop
tab = np.array([np.bincount(values[zones==zone]) for zone in np.unique(zones)])
You can do this more simply with histogram, though, if you are careful with the bin edges:
np.histogram2d(zones.flatten(), values.flatten(), bins=[np.unique(zones).size, values.max()-values.min()+1])
The way this works is as follows. The easiest example is to look at all values regardless of zone:
np.bincount(values)
Which gives you one row with the counts for each value (0 to 10). The next step is to look at the zones.
For one zone, you'd have just one row, and it would be:
zone = 101 # the desired zone
mask = zone==zones # a mask that is True wherever your zones map matches the desired zone
np.bincount(values[mask]) # count the values where the mask is True
Now, we just want to do this for each zone in the map. You can get a list of the unique values in your zones map with
zs = np.unique(zones)
and loop through it with a list comprehension, where each item is one of the rows as above:
tab = np.array([np.bincount(values[zones==zone]) for zone in np.unique(zones)])
Then, your table looks like this:
print tab
# elements with cover =
# 0 1 2 3 4 5 6 7 8 9 # in zone:
[[16 11 10 12 13 15 11 7 13 12] # 100
[13 23 15 16 24 16 24 21 15 13] # 101
[10 12 23 13 12 11 11 5 11 12] # 102
[19 25 20 12 16 19 13 18 22 16]] # 103
Finally, you can plot this in matplotlib as so:
import matplotlib.pyplot as plt
plt.hist2d(zones.flatten(), values.flatten(), bins=[np.unique(zones).size, values.max()-values.min()+1])

histogram2d expects 1D arrays as input, and your zones and values are 2D. You could linearize them with ravel:
np.histogram2d(zones.ravel(), values.ravel())

If efficiency isn't a concern, I think this works for what you want to do
from collections import Counter
c = Counter(zip(zones.flat[:], landcover_classes.flat[:]))
c will contain key/val tuples where the key is a tuple of (zone, landcover class). You can populate an array if you like with
for (i, j), count in c.items():
my_table[i, j] = count
That only works, of course, if i and j are sequential integers starting at zero (i.e., from 0 to Ni and 0 to Nj).

Python 3.1- Grid Simulation Conceptual issue

The goal is to treat a 1D array as a 2D grid. A second 1D array gives a list of values that need to be changed in the grid, and a third array indicates by how much.
The catch is that the values surrounding a modified value is changed also.
The example below stays as a 1D array, but makes calculations on it as if it was a 2D grid. It works; but currently it changes all values in the grid that match to a value in the 1D list (sample). I wan't to only convert 1 value and its surroundings, for 1 value in the list.
i.e. If the list is [2,3]; I only want to change the first 2 and 3 value that comes across in the iteration. The example at the moment, alters every 2 in the grid.
What is confusing me is that (probably because of the way I have structured the modifying calculations), I can't simply iterate through the grid and remove a list value each time it matches.
Thank you in advance for your time!!
The code is following;
import numpy
def grid_range(value):
if value > 60000:
value = 60000
return (value)
elif value < 100:
value = 100
return(value)
elif value <= 60000 and value >= 100:
return(value)
def grid(array,samples,details):
original_length = len(array)
c = int((original_length)**0.5)
new_array = [] #create a new array with the modified values
for elem in range (len(array)): #if the value is in samples
if array[elem] in samples:
value = array[elem] + (array[elem] * (details[1]/100))
test_range = grid_range(value)
new_array.append(test_range)
elif ((elem + 1) < original_length) and array[elem - 1] in samples: #change the one before the value
if (len(new_array) % c == 0) and array[elem + 1] not in samples:
new_array.append(array[elem])
else:
new_forward_element = array[elem] +(array[elem] * (details[2]/100))
test_range1 = grid_range(new_forward_element)
new_array.append(test_range1)
elif ((elem + 1) < original_length) and (array[elem + 1]) in samples: #change the one before and that it doesn't attempt to modify passed the end of the array
if (len(new_array) + 1) % c == 0:
new_array.append(array[elem])
else:
new_back_element = array[elem] +(array[elem] * (details[2]/100))
test_range2 = grid_range(new_back_element)
new_array.append(test_range2)
elif ((elem+c) <= (original_length - c))and(array[elem + c]) in samples: #if based on the 9 numbers on the right of the keyboard with test value numebr 5; this is position '2'
extra1 = array[elem] +(array[elem] * (details[2]/100))
test_range3 = grid_range(extra1)
new_array.append(test_range3)
elif (array[abs(elem - c)]) in samples: #position '8'
extra2 = array[elem] +(array[elem] * (details[2]/100))
test_range4 = grid_range(extra2)
new_array.append(test_range4)
elif (array[abs(elem - (c-1))]) in samples: #position '7'
if (elem - (c-1)) % c == 0:
new_array.append(array[elem])
else:
extra3 = array[elem] +(array[elem] * (details[2]/100))
test_range5 = grid_range(extra3)
new_array.append(test_range5)
elif (array[abs(elem - (c+1))]) in samples: #position '9'
if (elem - (c+1) + 1) % c == 0:
new_array.append(array[elem])
else:
extra4 = array[elem] +(array[elem] * (details[2]/100))
test_range6 = grid_range(extra4)
new_array.append(test_range6)
elif ((elem +(c-1)) < original_length) and (array[elem + (c-1)]) in samples: #position '1', also not passed total array length
if (elem + (c-1)+ 1) % c == 0:
new_array.append(array[elem])
else:
extra5 = array[elem] +(array[elem] * (details[2]/100))
test_range7 = grid_range(extra5)
new_array.append(test_range7)
elif (elem + (c+1)) < (len(array)- c) and (array[elem + (c+1)]) in samples: #position '3', also not passed total array length
if (elem + (c+1)) % c == 0:
new_array.append(array[elem])
else:
extra6 = array[elem] +(array[elem] * (details[2]/100))
test_range8 = grid_range(extra6)
new_array.append(test_range8)
else:
new_array.append(array[elem])
return(new_array)
a = [16,2,20,4,14,6,70,8,9,100,32,15,7,14,50,20,17,10,9,20,7,17,50,2,19,20]
samples = [2]
grid_details = [10,50,100]
result = grid(a,samples,grid_details)
EDIT:
Based on your answer Joe, I have created a version which modifies the main value (centre) by a specific % and the surrounding elements by another. However, how do I ensure that the changed values are not converted again during the next iteration of samples.
Thank you for your time!
Example code:
def grid(array,samples,details):
#Sides of the square (will be using a squarable number
Width = (len(array)) ** 0.5
#Convert to grid
Converted = array.reshape(Width,Width)
#Conversion details
Change = [details[1]] + [details[2]]
nrows, ncols = Converted.shape
for value in samples:
#First instance where indexing returns it
i,j = np.argwhere(Converted == value)[0]
#Prevent indexing outside the boudaries of the
#array which would cause a "wraparound" assignment
istart, istop = max(i-1, 0), min(i+2, nrows)
jstart, jstop = max(j-1, 0), min(j+2, ncols)
#Set the value within a 3x3 window to their "new_value"
for elem in Converted[istart:istop, jstart:jstop]:
Converted[elem] = elem + (elem * (value * ((Change[1]/100))
#Set the main value to the new value
Converted[i,j] = value + (value * ((Change[0])/100))
#Convert back to 1D list
Converted.tolist()
return (Converted)
a = [16,2,20,4,14,6,70,8,9,100,32,15,7,14,50,20,17,10,9,20,7,17,50,2,19,20,21,22,23,24,25]
samples = [2, 7]
grid_details = [10,50,100]
result = grid(a,samples,grid_details)
print(result)
PS: I wan't to avoid modifying any value in the grid, which has previously been modified, be it the main value or the surrounding values.

First off, I'm not quite sure what you're asking, so forgive me if I've completely misunderstood your question...
You say that you only want to modify the first item that equals a given value, and not all of them. If so, you're going to need to add a break after you find the first value, otherwise you'll continue looping and modify all the other values.
However, there are better ways to do what you want.
Also, you're importing numpy at top and then never(?) using it...
This is exactly the sort of thing that you'd want to use numpy for, so I'm going to give an example of using it.
It appears that you're just applying a function to a 3x3 moving window of a 2D array, where the values of the array match some given value.
If we want to set 3x3 area around a given index to some value, we'd just do something like this:
x[i-1:i+1, j-1:j+1] = value
...where x is your array, i and j are the row and column, and value is the value you want to set them to. (similarly, x[i-1:i+1, j-1:j+1] returns the 3x3 array around <i,j>)
Furthermore, if we want to know the <i,j> indicates where a particular value occurs within an array, we can use numpy.argwhere which will return a list of the <i,j> indicates for each place where a given condition is true.
(Using conditionals on a numpy array results in a boolean array showing where the condition is true or false. So, x >= 10 will yield a boolean array of the same shape as x, not simply True or False. This lets you do nice things like x[x>100] = 10 to set all values in x that are above 100 to 10.)
To sum it all up, I believe this snippet does what you want to do:
import numpy as np
# First let's generate some data and set a few duplicate values
data = np.arange(100).reshape(10,10)
data[9,9] = 2
data[8,6] = 53
print 'Original Data:'
print data
# We want to replace the _first_ occurences of "samples" with the corresponding
# value in "grid_details" within a 3x3 window...
samples = [2, 53, 69]
grid_details = [200,500,100]
nrows, ncols = data.shape
for value, new_value in zip(samples, grid_details):
# Notice that were're indexing the _first_ item than argwhere returns!
i,j = np.argwhere(data == value)[0]
# We need to make sure that we don't index outside the boundaries of the
# array (which would cause a "wraparound" assignment)
istart, istop = max(i-1, 0), min(i+2, nrows)
jstart, jstop = max(j-1, 0), min(j+2, ncols)
# Set the value within a 3x3 window to be "new_value"
data[istart:istop, jstart:jstop] = new_value
print 'Modified Data:'
print data
This yields:
Original Data:
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 50 87 88 89]
[90 91 92 93 94 95 96 97 98 2]]
Modified Data:
[[ 0 200 200 200 4 5 6 7 8 9]
[ 10 200 200 200 14 15 16 17 18 19]
[ 20 21 22 23 24 25 26 27 28 29]
[ 30 31 32 33 34 35 36 37 38 39]
[ 40 41 500 500 500 45 46 47 48 49]
[ 50 51 500 500 500 55 56 57 100 100]
[ 60 61 500 500 500 65 66 67 100 100]
[ 70 71 72 73 74 75 76 77 100 100]
[ 80 81 82 83 84 85 50 87 88 89]
[ 90 91 92 93 94 95 96 97 98 2]]
Finally, you mentioned that you wanted to "view something as both an N-dimensional array and a "flat" list". This is in a sense what numpy arrays already are.
For example:
import numpy as np
x = np.arange(9)
y = x.reshape(3,3)
print x
print y
y[2,2] = 10000
print x
print y
Here, y is a "view" into x. If we change an element of y we change the corresponding element of x and vice versa.
Similarly, if we have a 2D array (or 3D, 4D, etc) that we want to view as a "flat" 1D array, you can just call flat_array = y.ravel() where y is your 2D array.
Hope that helps, at any rate!

You didn't specify that you had to do it any specific way so I'm assuming you're open to suggestions.
A completely different(and IMHO simpler) way would be to make an array of arrays:
grid = [[0,0,0,0,0],
[0,0,0,2,0],
[1,0,0,0,0],
[0,0,0,0,0],
[0,0,3,0,0]]
To access a location on the grid, simply supply the index of the list(the row), then the index of the location on that grid(the column). For example:
1 = grid[2][0]
2 = grid[1][3]
3 = grid[4][2]
To create a non-hardcoded grid(e.g. of a variable size):
def gridder(width,height):
list = []
sublist = []
for i in range(0,width):
sublist.append(1)
for i in range(0,height):
list.append(sublist)
return list
To modify a part of your grid:
def modifier(x,y,value):
grid[y][x] = value
*If this is homework and you're supposed to do it the way the specified in your answer, then you probably can't use this answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to join dataset on coordinates? - python

Related

Group a column such that its sum is approximately equal in each group

How to check correlation between matching columns of two data sets?

Filling in missing data in Python

Trouble with numpy.histogram2d

Python 3.1- Grid Simulation Conceptual issue

Categories

Resources