Python 3.1- Grid Simulation Conceptual issue - python

The goal is to treat a 1D array as a 2D grid. A second 1D array gives a list of values that need to be changed in the grid, and a third array indicates by how much.
The catch is that the values surrounding a modified value is changed also.
The example below stays as a 1D array, but makes calculations on it as if it was a 2D grid. It works; but currently it changes all values in the grid that match to a value in the 1D list (sample). I wan't to only convert 1 value and its surroundings, for 1 value in the list.
i.e. If the list is [2,3]; I only want to change the first 2 and 3 value that comes across in the iteration. The example at the moment, alters every 2 in the grid.
What is confusing me is that (probably because of the way I have structured the modifying calculations), I can't simply iterate through the grid and remove a list value each time it matches.
Thank you in advance for your time!!
The code is following;
import numpy
def grid_range(value):
if value > 60000:
value = 60000
return (value)
elif value < 100:
value = 100
return(value)
elif value <= 60000 and value >= 100:
return(value)
def grid(array,samples,details):
original_length = len(array)
c = int((original_length)**0.5)
new_array = [] #create a new array with the modified values
for elem in range (len(array)): #if the value is in samples
if array[elem] in samples:
value = array[elem] + (array[elem] * (details[1]/100))
test_range = grid_range(value)
new_array.append(test_range)
elif ((elem + 1) < original_length) and array[elem - 1] in samples: #change the one before the value
if (len(new_array) % c == 0) and array[elem + 1] not in samples:
new_array.append(array[elem])
else:
new_forward_element = array[elem] +(array[elem] * (details[2]/100))
test_range1 = grid_range(new_forward_element)
new_array.append(test_range1)
elif ((elem + 1) < original_length) and (array[elem + 1]) in samples: #change the one before and that it doesn't attempt to modify passed the end of the array
if (len(new_array) + 1) % c == 0:
new_array.append(array[elem])
else:
new_back_element = array[elem] +(array[elem] * (details[2]/100))
test_range2 = grid_range(new_back_element)
new_array.append(test_range2)
elif ((elem+c) <= (original_length - c))and(array[elem + c]) in samples: #if based on the 9 numbers on the right of the keyboard with test value numebr 5; this is position '2'
extra1 = array[elem] +(array[elem] * (details[2]/100))
test_range3 = grid_range(extra1)
new_array.append(test_range3)
elif (array[abs(elem - c)]) in samples: #position '8'
extra2 = array[elem] +(array[elem] * (details[2]/100))
test_range4 = grid_range(extra2)
new_array.append(test_range4)
elif (array[abs(elem - (c-1))]) in samples: #position '7'
if (elem - (c-1)) % c == 0:
new_array.append(array[elem])
else:
extra3 = array[elem] +(array[elem] * (details[2]/100))
test_range5 = grid_range(extra3)
new_array.append(test_range5)
elif (array[abs(elem - (c+1))]) in samples: #position '9'
if (elem - (c+1) + 1) % c == 0:
new_array.append(array[elem])
else:
extra4 = array[elem] +(array[elem] * (details[2]/100))
test_range6 = grid_range(extra4)
new_array.append(test_range6)
elif ((elem +(c-1)) < original_length) and (array[elem + (c-1)]) in samples: #position '1', also not passed total array length
if (elem + (c-1)+ 1) % c == 0:
new_array.append(array[elem])
else:
extra5 = array[elem] +(array[elem] * (details[2]/100))
test_range7 = grid_range(extra5)
new_array.append(test_range7)
elif (elem + (c+1)) < (len(array)- c) and (array[elem + (c+1)]) in samples: #position '3', also not passed total array length
if (elem + (c+1)) % c == 0:
new_array.append(array[elem])
else:
extra6 = array[elem] +(array[elem] * (details[2]/100))
test_range8 = grid_range(extra6)
new_array.append(test_range8)
else:
new_array.append(array[elem])
return(new_array)
a = [16,2,20,4,14,6,70,8,9,100,32,15,7,14,50,20,17,10,9,20,7,17,50,2,19,20]
samples = [2]
grid_details = [10,50,100]
result = grid(a,samples,grid_details)
EDIT:
Based on your answer Joe, I have created a version which modifies the main value (centre) by a specific % and the surrounding elements by another. However, how do I ensure that the changed values are not converted again during the next iteration of samples.
Thank you for your time!
Example code:
def grid(array,samples,details):
#Sides of the square (will be using a squarable number
Width = (len(array)) ** 0.5
#Convert to grid
Converted = array.reshape(Width,Width)
#Conversion details
Change = [details[1]] + [details[2]]
nrows, ncols = Converted.shape
for value in samples:
#First instance where indexing returns it
i,j = np.argwhere(Converted == value)[0]
#Prevent indexing outside the boudaries of the
#array which would cause a "wraparound" assignment
istart, istop = max(i-1, 0), min(i+2, nrows)
jstart, jstop = max(j-1, 0), min(j+2, ncols)
#Set the value within a 3x3 window to their "new_value"
for elem in Converted[istart:istop, jstart:jstop]:
Converted[elem] = elem + (elem * (value * ((Change[1]/100))
#Set the main value to the new value
Converted[i,j] = value + (value * ((Change[0])/100))
#Convert back to 1D list
Converted.tolist()
return (Converted)
a = [16,2,20,4,14,6,70,8,9,100,32,15,7,14,50,20,17,10,9,20,7,17,50,2,19,20,21,22,23,24,25]
samples = [2, 7]
grid_details = [10,50,100]
result = grid(a,samples,grid_details)
print(result)
PS: I wan't to avoid modifying any value in the grid, which has previously been modified, be it the main value or the surrounding values.

First off, I'm not quite sure what you're asking, so forgive me if I've completely misunderstood your question...
You say that you only want to modify the first item that equals a given value, and not all of them. If so, you're going to need to add a break after you find the first value, otherwise you'll continue looping and modify all the other values.
However, there are better ways to do what you want.
Also, you're importing numpy at top and then never(?) using it...
This is exactly the sort of thing that you'd want to use numpy for, so I'm going to give an example of using it.
It appears that you're just applying a function to a 3x3 moving window of a 2D array, where the values of the array match some given value.
If we want to set 3x3 area around a given index to some value, we'd just do something like this:
x[i-1:i+1, j-1:j+1] = value
...where x is your array, i and j are the row and column, and value is the value you want to set them to. (similarly, x[i-1:i+1, j-1:j+1] returns the 3x3 array around <i,j>)
Furthermore, if we want to know the <i,j> indicates where a particular value occurs within an array, we can use numpy.argwhere which will return a list of the <i,j> indicates for each place where a given condition is true.
(Using conditionals on a numpy array results in a boolean array showing where the condition is true or false. So, x >= 10 will yield a boolean array of the same shape as x, not simply True or False. This lets you do nice things like x[x>100] = 10 to set all values in x that are above 100 to 10.)
To sum it all up, I believe this snippet does what you want to do:
import numpy as np
# First let's generate some data and set a few duplicate values
data = np.arange(100).reshape(10,10)
data[9,9] = 2
data[8,6] = 53
print 'Original Data:'
print data
# We want to replace the _first_ occurences of "samples" with the corresponding
# value in "grid_details" within a 3x3 window...
samples = [2, 53, 69]
grid_details = [200,500,100]
nrows, ncols = data.shape
for value, new_value in zip(samples, grid_details):
# Notice that were're indexing the _first_ item than argwhere returns!
i,j = np.argwhere(data == value)[0]
# We need to make sure that we don't index outside the boundaries of the
# array (which would cause a "wraparound" assignment)
istart, istop = max(i-1, 0), min(i+2, nrows)
jstart, jstop = max(j-1, 0), min(j+2, ncols)
# Set the value within a 3x3 window to be "new_value"
data[istart:istop, jstart:jstop] = new_value
print 'Modified Data:'
print data
This yields:
Original Data:
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 50 87 88 89]
[90 91 92 93 94 95 96 97 98 2]]
Modified Data:
[[ 0 200 200 200 4 5 6 7 8 9]
[ 10 200 200 200 14 15 16 17 18 19]
[ 20 21 22 23 24 25 26 27 28 29]
[ 30 31 32 33 34 35 36 37 38 39]
[ 40 41 500 500 500 45 46 47 48 49]
[ 50 51 500 500 500 55 56 57 100 100]
[ 60 61 500 500 500 65 66 67 100 100]
[ 70 71 72 73 74 75 76 77 100 100]
[ 80 81 82 83 84 85 50 87 88 89]
[ 90 91 92 93 94 95 96 97 98 2]]
Finally, you mentioned that you wanted to "view something as both an N-dimensional array and a "flat" list". This is in a sense what numpy arrays already are.
For example:
import numpy as np
x = np.arange(9)
y = x.reshape(3,3)
print x
print y
y[2,2] = 10000
print x
print y
Here, y is a "view" into x. If we change an element of y we change the corresponding element of x and vice versa.
Similarly, if we have a 2D array (or 3D, 4D, etc) that we want to view as a "flat" 1D array, you can just call flat_array = y.ravel() where y is your 2D array.
Hope that helps, at any rate!

You didn't specify that you had to do it any specific way so I'm assuming you're open to suggestions.
A completely different(and IMHO simpler) way would be to make an array of arrays:
grid = [[0,0,0,0,0],
[0,0,0,2,0],
[1,0,0,0,0],
[0,0,0,0,0],
[0,0,3,0,0]]
To access a location on the grid, simply supply the index of the list(the row), then the index of the location on that grid(the column). For example:
1 = grid[2][0]
2 = grid[1][3]
3 = grid[4][2]
To create a non-hardcoded grid(e.g. of a variable size):
def gridder(width,height):
list = []
sublist = []
for i in range(0,width):
sublist.append(1)
for i in range(0,height):
list.append(sublist)
return list
To modify a part of your grid:
def modifier(x,y,value):
grid[y][x] = value
*If this is homework and you're supposed to do it the way the specified in your answer, then you probably can't use this answer.

Related

Using np.vectorize to create a column in a data frame

I have a data fram that contains two columns with numbers and a third column with repeating letters. Let's say somthing like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('xy'))
letters = ['A', 'B', 'C', 'D'] * int(len(df.index) / 4)
df['letters'] = letters
I want to create two new columns, which compares the number in columns 'x' and 'y' to the average of their corresponding letters. For example one new column will just contain the number 10 (if 20% or better than the mean), -10 (if 20% worse than the mean) or else 0.
I wrote the function below:
def scoreFunHigh(dataField, mean, diff, multip):
upper = mean * (1 + diff)
lower = mean * (1 - diff)
if dataField > upper:
return multip * 1
elif dataField < lower:
return multip * (-1)
else:
return 0
And then created the column as follows:
letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
df['letter x score'] = np.vectorize(scoreFunHigh)(df['x'], letterMeanX, 0.2, 10)
letterMeanY = df.groupby('letters')['y'].transform(np.nanmean)
df['letter y score'] = np.vectorize(scoreFunHigh)(df['y'], letterMeanY, 0.3, 5)
This works. However, I am getting the below runtime waring:
C:\Users\ ... \Python\Python38\lib\site-packages\numpy\lib\function_base.py:2167: RuntimeWarning: invalid value encountered in ? (vectorized)
outputs = ufunc(*inputs)
(Please note, that if I am running the exact same code as above I am not getting the error. My real dataframe is much larger and I am using several functions for different data)
What is the problem here? Is there a better way to set this up?
Thank you very much
The sample you give does not produce the runtimewarning, so we can't do anything to help you diagnose it. I don't know if a fuller traceback provides any useful information.
But lets look at the calculations:
In [70]: np.vectorize(scoreFunHigh)(df['x'], letterMeanX, 0.2, 10)
Out[70]:
array([-10, 0, 10, -10, 0, 0, -10, -10, 10, 0, 0, 10, -10,
-10, 0, 10, 10, -10, 0, 10, -10, -10, -10, 10, 10, -10,
...
-10, 10, -10, 0, 0, 10, 10, 0, 10])
and with the df assignment:
In [74]: df['letter x score'] = np.vectorize(scoreFunHigh)(df['x'], letterMeanX,
...: 0.2, 10)
...:
In [75]: df
Out[75]:
x y letters letter x score
0 33 98 A -10
1 38 49 B 0
2 78 46 C 10
3 31 46 D -10
4 41 74 A 0
.. .. .. ... ...
95 51 4 D 0
96 70 4 A 10
97 74 74 B 10
98 54 70 C 0
99 87 44 D 10
Often np.vectorize gives problems because of the otypes issue (read the docs); if the trial calculation produces an integer, then the return dtype is set to that, giving problems if other values are floats. However in this case the result can only have one of three values, [-10,0,10] (the last parameter).
The warning, such as you provide, suggests that some value(s) in the larger dataframe are wrong for the calculations in your scoreFunHigh function. But the warning doesn't give enough detail to say what.
It is relatively easy to apply real numpy vectorization to this problem, since it depends on two Series, df['x] an letterMeanX and 2 scalars.
In [111]: letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
In [112]: letterMeanX.shape
Out[112]: (100,)
In [113]: df['x'].shape
Out[113]: (100,)
In [114]: upper = letterMeanX *(1+0.2)
In [115]: lower = letterMeanX *(1-0.2)
In [116]: res = np.zeros(letterMeanX.shape,int)
In [117]: res[df['x']>upper] = 10
In [118]: res[df['x']<lower] = -10
In [119]: np.allclose(res, Out[70])
Out[119]: True
In other words, rather than applying the upper/lower comparison row by row, it applies it to the whole Series. It is still iterating, but in compiled numpy methods, which are much faster. np.vectorize is just a wrapper around an iteration. It still calls your python function once for each row. Hopefully the performance disclaimer is clear enough.
Consider directly calling your function with slight adjustment to method to handle conditional logic using numpy.select (or numpy.where). With this approach no loops are run but vectorized operations on the Series and scalar parameters:
def scoreFunHigh(dataField, mean, diff, multip):
conds = [dataField > mean * (1 + diff),
dataField < mean * (1 - diff)]
vals = [multip * 1, multip * (-1)]
return np.select(conds, vals, default=0)
letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
df['letter x score'] = scoreFunHigh(df['x'], letterMeanX, 0.2, 10)
letterMeanY = df.groupby('letters')['y'].transform(np.nanmean)
df['letter y score'] = scoreFunHigh(df['y'], letterMeanY, 0.3, 5)
Here is version that doesn't use np.vectorize
def scoreFunHigh(val, mean, diff, multip):
upper = mean * (1 + diff)
lower = mean * (1 - diff)
if val > upper:
return multip * 1
elif val < lower:
return multip * (-1)
else:
return 0
letterMeanX = df.groupby('letters')['x'].apply(lambda x: np.nanmean(x))
df['letter x score'] = df.apply(
lambda row: scoreFunHigh(row['x'], letterMeanX[row['letters']], 0.2, 10), axis=1)
Output
x y letters letter x score
0 52 76 A 0
1 90 99 B 10
2 87 43 C 10
3 44 73 D 0
4 49 3 A 0
.. .. .. ... ...
95 16 51 D -10
96 38 3 A 0
97 43 47 B 0
98 58 39 C 0
99 41 26 D 0

How to join dataset on coordinates?

I am analysing Datasets and I need to compare them. The two Datasets got an Index and Coordinates(X,Y) each. The coordinates are not equal so I need to use something like the numpy.isclose (e.g. atol=5) function.
My aim in the comparison is to find similar y coordinates (e.g. y[5]= 101 (Dataset1), y2[7] = 103 (Dataset2)). And I need to compare the x-coordinates of the same indices (e.g. x[5]= 405 (Dataset1), x2[7] = 401 (Dataset2))
My problem is that I cant combine these two isclose functions
I have tried to compare at first the y and afterwards the x coordinates. If it is a separate comparison the function will find other Data as well. (e.g. y[5] = 101, y2[7] = 103; x[5] = 405, x[3] = 402). It needs to compare same indices (5/5 and 7/7).
This is working but gives wrong results:
yres = {i for i in yvar if numpy.isclose(yvar2, i, atol= 5).any()}
xres = {i for i in xvar if numpy.isclose(xvar2, i, atol= 5).any()}
Theoretically i am searching for something like this:
yres = {i for i in yvar if numpy.isclose(yvar2, i, atol= 5).any() & i for i in xvar if numpy.isclose(xvar2, i, atol= 5).any()}
Expect finding points with similar coordinates
(e.g. y[5]=101, y2[7] = 103 ; x[5] = 405 , x2[7] = 401).
At the moment I receive any similar data
(e.g. y[5]=101, y2[7] = 103 ; x[5] = 405 , x2[3] = 402).
Bellow Input example (Picture1 and Picture2):
Pict1
Pict2
In this picture I need to identify 4 point pairs (Index pict1 / Index pict2):
6 / 9
7 / 8
17 / 13
20 / 14
Forewords
Your question is related to Nearest Neighbors Search (NNS).
One way to solve it is to build a spatial index like in Spatial Databases.
A straightforward solution is KD-Tree which is implemented in sklearn.
Questions
At this point it is essential to know what question we want to answer:
Q1.a) Find all points in dataset B which are as close as (distance) points of A within a given threshold atol (radius).
Or:
Q2.a) Find the k closest point in a dataset B with respect to each point of my dataset A.
Both questions can be answered using KD-Tree, what we must realise is:
Questions Q1 and Q2 are different, so are their answers;
Q1 can map 0 or more points together, there is no guaranty about one-to-one mapping;
Q2 will map exactly 1 to k points, there is a guaranty that all points in reference dataset are mapped to k points in search dataset (provided there is enough points);
Q2.a is generally not equivalent to its reciprocal question Q2.b (when datasets A and B are permuted).
MCVE
Lets build a MCVE to address both questions:
# Parameters
N = 50
atol = 50
keys = ['x', 'y']
# Trials Datasets (with different sizes, we keep it general):
df1 = pd.DataFrame(np.random.randint(0, 500, size=(N-5, 2)), columns=keys).reset_index()
df2 = pd.DataFrame(np.random.randint(0, 500, size=(N+5, 2)), columns=keys).reset_index()
# Spatial Index for Datasets:
kdt1 = KDTree(df1[keys].values, leaf_size=5, metric='euclidean')
kdt2 = KDTree(df2[keys].values, leaf_size=5, metric='euclidean')
# Answer Q2.a and Q2.b (searching for a single neighbour):
df1['kNN'] = kdt2.query(df1[keys].values, k=1, return_distance=False)[:,0]
df2['kNN'] = kdt1.query(df2[keys].values, k=1, return_distance=False)[:,0]
# Answer Q1.a and Q1.b (searching within a radius):
df1['radius'] = kdt2.query_radius(df1[keys].values, atol)
df2['radius'] = kdt1.query_radius(df2[keys].values, atol)
Bellow the result for Dataset A as reference:
index x y kNN radius
0 0 65 234 39 [39]
1 1 498 49 11 [11]
2 2 56 171 19 [29, 19]
3 3 239 43 20 [20]
4 4 347 32 50 [50]
[...]
At this point, we have everything required to spatially join our data.
Nearest Neighbors (k=1)
We can join our datasets using kNN index:
kNN1 = df1.merge(df2[['index'] + keys], left_on='kNN', right_on='index', suffixes=('_a', '_b'))
It returns:
index_a x_a y_a kNN radius index_b x_b y_b
0 0 65 234 39 [39] 39 49 260
1 1 498 49 11 [11] 11 487 4
2 2 56 171 19 [29, 19] 19 39 186
3 3 239 43 20 [20] 20 195 33
4 4 347 32 50 [50] 50 382 32
[...]
Graphically it leads to:
And reciprocal question is about:
We see that mapping is exactly 1-to-k=1 all points in reference dataset are mapped to another point in search dataset. But answers differ when we swap reference.
Radius atol
We can also join our datasets using the radius index:
rad1 = df1.explode('radius')\
.merge(df2[['index'] + keys], left_on='radius', right_on='index',
suffixes=('_a', '_b'))
It returns:
index_a x_a y_a kNN radius index_b x_b y_b
0 0 65 234 39 39 39 49 260
2 1 498 49 11 11 11 487 4
3 2 56 171 19 29 29 86 167
4 2 56 171 19 19 19 39 186
7 3 239 43 20 20 20 195 33
[...]
Graphically:
Reciprocal answer is equivalent:
We see answers are identical, but there is no guaranty for a one-to-one mapping. Some points are not mapped (lonely points), some are mapped to many points (dense neighbourhood). Additionally, it requires an extra parameters atol which must be tuned for a given context.
Bonus
Bellow the function to render figures:
def plot(A, B, join, title=''):
X = join.loc[:,['x_a','x_b']].values
Y = join.loc[:,['y_a','y_b']].values
fig, axe = plt.subplots()
axe.plot(A['x'], A['y'], 'x', label='Dataset A')
axe.plot(B['x'], B['y'], 'x', label='Dataset B')
for k in range(X.shape[0]):
axe.plot(X[k,:], Y[k,:], linewidth=0.75, color='black')
axe.set_title(title)
axe.set_xlabel(r'$x$')
axe.set_ylabel(r'$y$')
axe.grid()
axe.legend(bbox_to_anchor=(1,1), loc='upper left')
return axe
References
Some useful references:
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html
https://en.wikipedia.org/wiki/Nearest_neighbor_search
https://en.wikipedia.org/wiki/K-d_tree
https://scikit-learn.org/stable/modules/neighbors.html
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius

for loop in Python, change the list it was looping

cleanedList = [x for x in range(0, 100, 1)]
idx = 0
for val in cleanedList:
check = abs(cleanedList[idx])
idx = idx + 1
if check % 5 == 0: ##### Conditions changed and change the list
cleanedList = a new list that loops over.
This is arbitrary example. I want to change the list it is looping now when the conditions fails. I tried this way. I don't think it actually changed the list it is looping now. Please correct me.
It is not advisable to change the list over which you are looping. However, if this is what you really want, then you could do it this way:
cleanedList = list(range(0, 100, 1))
for i, _ in enumerate(cleanedList):
check = abs(cleanedList[i])
if check % 5 == 0: ##### Change the list it is looping now
cleanedList[:] = range(60, 100, 2)
This is an interesting one, because you haven't actually mutated the list.
cleanedList = [x for x in range(0, 100, 1)] # Creates list1
idx = 0
for val in cleanedList: # begin iterating list1. It's stored internally here.
check = abs(cleanedList[idx])
print val, check,
idx = idx + 1
if check < 30: ##### Change the list it is looping now
cleanedList = [x for x in range(60,100,2)] # reassign here, but it becomes list2.
The output tells the story:
0 0 1 62 2 64 3 66 4 68 5 70 6 72 7 74 8 76 9 78 10 80 11 82 12 84 13 86 14 88 15 90 16 92 17 94 18 96 19 98
Because you didn't mutate, you reassigned, the dangling reference to the list you're iterating over initially still exists for the context of the for loop, and it continues way past the end of list 2, which is why you eventually throw IndexError - there are 100 items in your first list, and only 20 in your second list.
Very briefly, when you want to edit a list you're iterating over, you should use a copy of the list. so your code simply transfers to:
for val in cleanedList[:]:
and you can have all kinds of edits on your original cleanedList and no error will show up.

How to check correlation between matching columns of two data sets?

If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:
If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])
This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023
I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)
Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))

Finding row wise average and average of all rows of space delimited file

I have a file which has many lines and each row would look like below:
10 55 19 51 2 9 96 64 60 2 45 39 99 60 34 100 33 71 49 13
77 3 32 100 68 90 44 100 10 52 96 95 36 50 96 39 81 25 26 13
Each line as numbers separated by space and each line( row is of different length)
How can I find the average of each row?
How can I find Sum of all the row wise averages?
Preferred language Python
Below code does the task mentioned:
def rowAverageSum(filename):
import numpy as np
FullMean = 0
li = [map(int, x) for x in [i.strip().split() for i in open(filename).readlines()]]
i=0
while i<len(li):
for k in li:
print "Mean of row ",i+1,":",np.mean(k)
FullMean+=np.mean(k)
i+=1
print "***************************"
print "Grand Average:",FullMean
print "***************************"
Using two utility functions words (to get the words in a line) and average (to get the average of a sequence of integers), I'd start wth something like
def words(s):
return (w for w in s.strip().split())
def average(l):
return sum(l) / len(l)
with open('input.txt') as f:
averages = [average(map(int, words(line))) for line in f]
total = sum(averages)
I like the total = sum(averages) part which very closely resembles your second requirement (the sum of all averages). :-)
I used map(int, words(line)) (to convert a list of strings to a list of integers) simply because it's shorter than [int(x) for x in words(line)] even though the latter would most certainly be considered to be "more Pythonic".
how about trying this in a short way?
avg_per_row = [];
avg_all_row = 0;
f1 = open("myfile") # Default mode is read
for line in f1:
temp = line.split();
avg = sum([int(x) for x in temp])/length(temp)
avg_per_row.append(avg); # Average per row
avg_all_row = sum(avg_per_row)/len(avg_per_row) # Average for all averages
Very compressed, but should work for you
3 / 2 is 1 in the Python. So you want to float result you should convert float.
float(3) / 2 is 1.5
>>> s = '''10 55 19 51 2 9 96 64 60 2 45 39 99 60 34 100 33 71 49 13
77 3 32 100 68 90 44 100 10 52 96 95 36 50 96 39 81 25 26 13'''
>>> line_averages = []
>>> for line in s.splitlines():
... line_averages.append(sum([ float(ix) for ix in line.split()]) / len(line.split()))
...
>>> line_averages
[45.55, 56.65]
>>> sum(line_averages)
102.19999999999999
Or you can use reduce
>>> for line in s.splitlines():
... line_averages.append(reduce(lambda x,y: int(x) + int(y), line.split()) / len(line.split()))
>>> line_averages
[45, 56]
>>> reduce(lambda x,y: int(x) + int(y), line_averages)
101
>>> f = open('yourfile')
>>> averages = [ sum(map(float,x.strip().split()))/len(x.strip().split()) for x in f ]
>>> averages
[45.55, 56.65]
>>> sum(averages)
102.19999999999999
>>> sum(averages)/len(averages)
51.099999999999994
strip removes '\n' then split will split on whitespace will give list of the numbers, map will convert all numbers to float type. sum will sum all numbers.
if you don't understand above code, you can see this , its same as above but expanded :
>>> f = open('ll.txt')
>>> averages = []
>>> for x in f:
... x = x.strip() # removes newline character
... x = x.split() # split the lines on whitespaces and produces list of numbers
... x = [ float(i) for i in x ] # convert all number to type float
... avg = sum(x)/len(x) # calculate average ans store to variable avg
... averages.append(avg) # append the avg to list averages
...
>>> averages
[45.55, 56.65]
>>> sum(averages)/len(averages)
51.099999999999994

Categories

Resources