Python scatter - change marker style based on entity - python

I'm struggling for days trying to resolve this problem: I have cartesian coordinates on the y-axis (for depth from 0 to 1) and numbers with different values on the x axis (the numbers are the firing rate of different cells populations at the given depth on y axis, so they vary randomly).
I would like to show bigger size of markers in the scatterplot corresponding to a bigger x-axis value (firing rate).
Thank you for any suggestion.
This is the code (not working).
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.cbook as cbook
x = np.genfromtxt('x_dex.csv', delimiter=',')
y = np.genfromtxt('z_dex.csv', delimiter=',')
array = [i for i in x if i > 4]
array.sort()
s = [30*2**n for n in range(len(array))];
plt.subplot(212)
plt.scatter(x,y,s=s)
plt.show()
This is unfortunately not showing the correct relation between size of marker and depth.

The line where you compute your 'size' values looks incorrect to me:
s = [30*2**n for n in range(len(array))];
This will give you a list containing:
s = [30*2**0, 30*2**1, 30*2**2, ..., 30*2**(len(array) - 1)]
The values bear no relation to y, so I assume this is not what you intended. Maybe you meant something more like this:
s = 30 * 2 ** y
There are actually several other issues here:
Don't give your variables names like array - this can lead to confusion with numpy.array. It's even worse in this case, since array is actually not an array but a Python list!
Since you're dealing with numpy arrays, it's much faster to use vectorization rather than list comprehensions. For example, you could use:
array = x[x > 4]
rather than
array = [i for i in x if i > 4]
After your list comprehension array = [i for i in x if i > 4], array will have a different number of elements to y if there are elements in array that are less than 4.
array.sort() will sort the list in place, which means that the order of the elements in array will no longer match the order of elements in y.
In fact, sorting seems rather pointless in this situation - since you're making a scatter plot the order of the points should not matter.
You're not writing MATLAB code any more, so there's no need to end lines on a semicolon (although it won't do any harm if you do).
Here's my educated guess at what you're trying to do:
import matplotlib.pyplot as plt
import numpy as np
x = np.genfromtxt('x_dex.csv', delimiter=',')
y = np.genfromtxt('z_dex.csv', delimiter=',')
# get the set of indices that will sort x in ascending order, apply these
# to both x & y
order = np.argsort(x)
x_sorted = x[order]
y_sorted = y[order]
# keep only xy pairs where x > 4
valid = x_sorted > 4
x_valid = x_sorted[valid]
y_valid = y_sorted[valid]
# compute the sizes
s = 30 * 2 ** y_valid
# plot
plt.subplot(212)
plt.scatter(x_valid, y_valid, s=s)
plt.show()

Related

numpy: efficiently obtain a statistic over array elements grouped by the elements of another array

Apologies in advance for the potentially misleading title. I could not think of the way to properly word the problem without an illustrative example.
I have some data array (e.g.):
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
and a corresponding array of equal length which indicates which elements of x are grouped:
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
In this example, there are two groupings in x: [2,2,2,3,3,3,4,4,4] where y=0; and [1,1,2,2,3,3] where y=1. I want to obtain a statistic on all elements of x where y is 0, then 1. I would like this to be extendable to large arrays with many groupings. y is always ordered from lowest to highest AND is always sequentially increasing without any missing integers between the min and max. For example, y could be np.array([0,0,**1**,2,2,2,2,3,3,3]) for some x array of the same length but not y = np.array([0,0,**2**,2,2,2,2,3,3,3]) as this has no ones.
I can do this by brute force quite easily for this example.
import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
y_max = np.max(y)
stat_min = np.zeros(y_max+1)
stat_sum = np.zeros(y_max+1)
for i in np.arange(y_max+1):
stat_min[i] = np.min(x[y==i])
stat_sum[i] = np.sum(x[y==i])
print(stat_min)
print(stat_sum)
Gives: [2. 1.] and [27. 12.] for the minimum and sum statistics for each grouping, respectively. I need a way to make this efficient for large numbers of groupings and where the arrays are very large (> 1 million elements).
EDIT
A bit better with list comprehension.
import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
y_max = np.max(y)
stat_min = np.array([np.min(x[y==i]) for i in range(y_max+1)])
stat_sum = np.array([np.sum(x[y==i]) for i in range(y_max+1)])
print(stat_min)
print(stat_sum)
You'd put your arrays into a dataframe, then use groupby and the various methods of it: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
import pandas as pd
df = pd.DataFrame({'x': x, 'y': y})`
mins = df.groupby('y').min()

Cutting a subset from middle of NumPy array

I work with raster images and the module rasterio to imprt them as numpy arrays. I would like to cut a portion of size (1000, 1000) out of the middle of each (to avoid the out-of-bound masks of the image).
image = np.random.random_sample((2000, 2000))
s = image.shape
mid = [round(x / 2) for x in s] # middle point of both axes
margins = [[y + x for y in [-500, 500]] for x in mid] # 1000 range around every middle point
The result is a list of 2 lists, for the cut range on each axis. But this is where I stump: range() doesn't accept lists, and I'm attempting the following brute force method:
cut_image = image[range(margins[0][0], margins[0][1]), range(margins[1][0], margins[1][1])]
However:
cut_image.shape
## (1000,)
Slicing an array loses dimension information which is exactly what I don't want.
Consider me confused.
Looking for a more tasteful solution.
As the other answer points it out, you're not really slicing your array, but using indexing on it.
If you want to slice your array (and you're right, that's more elegant than using list of indices) , you'll be happier with slices. That objects represents the start:end:step syntax.
In your case,
import numpy as np
WND = 50
image = np.random.random_sample((200, 300))
s = image.shape
mid = [round(x / 2) for x in s] # middle point of both axes
margins = [[y + x for y in [-WND, WND]] for x in mid] # 1000 range around every middle point
# array[slice(start, end)] -> array[start:end]
x_slice = slice(margins[0][0], margins[0][1])
y_slice = slice(margins[1][0], margins[1][1])
print(x_slice, y_slice)
# slice(50, 150, None) slice(100, 200, None)
cut_image = image[x_slice, y_slice]
print(cut_image.shape)
# (100,100)
Indexing ?
You might wonder what was happening in your question that resulted in only 1000 elements instead of the expected 1000*1000.
Here is a simpler example of indexing with lists on different dimensions
# n and n2 have identical values
n = a[[i0, i1, i2],[j0, j1, j2]]
n2 = np.array([a[i0, j0], a[i1, j1], a[i2, j2]]
This being clarified, you'll understand that instead of taking a block matrix, your code only returns the diagonal coefficients of that block matrix :)
The issue here is that what you're doing is known as integer indexing, instead of slice indexing. The bahaviour changes and may seem counterintuitive when not acquainted with it. You can check the docs for more details.
Here's how you could do it with basic slicing:
# center coordinates of the image
x_0, y_0 = np.asarray(image.shape)//2
# slice taken from the center point
out = image[x_0-x_0//2:x_0+x_0//2, y_0-y_0//2:y_0+y_0//2]
print(out.shape)
# (1000, 1000)

Python: Filtering numpy values based on certain columns

I'm trying to create a method for evaluating co-ordinates for a project that's due in about a week.
Assuming that I'm working in a 3D cartesian co-ordinate system - whose values are stored as rows in a numpy array. I am trying to read if 'z' (n[i, 2]) values exist given the corresponding, predetermined 'x' (n[i,0]) and 'y' (n[i,1]) values.
In the case where the values that are assigned are scalars, I am content to think that:
# Given that n is some numpy array
x, y = 2,3
out = []
for i in range(0,n.shape[0]):
if n[i, 0] == x and n[i,1] == y:
out.append(n[i,2])
However, where the sorrow comes in is having to check if the values in another numpy array are in the original numpy array 'n'.
# Given that n is the numpy array that is to be searched
# Given that x contains the 'search elements'
out = []
for i in range(0,n.shape[0]):
for j in range(0, x.shape[0]):
if n[i, 0] == x[j,0] and n[i,1] == x[j,1]:
out.append(n[i,2])
The issue with doing it this way is that the 'n' matrix in my application may well be in excess of 100 000 lines long.
Is there a more efficient way of performing this function?
This might be more efficient than nested loops:
out = []
for row in x:
idx = np.equal(n[:,:2], row).all(1)
out.extend(n[idx,2].tolist())
Note this assumes that x is of shape (?, 2). Otherwise, if it has more than two columns, just change row to row[:2] in the loop body.
Numpythonic solution without loops.
This solution works in case the x and y coordinates are non-negative.
import numpy as np
# Using a for x and b for n, to avoid confusion with x,y coordinates and array names
a = np.array([[1,2],[3,4]])
b = np.array([[1,2,10],[1,2,11],[3,4,12],[5,6,13],[3,4,14]])
# Adjust the shapes by taking the z coordinate as 0 in a and take the dot product with b transposed
a = np.insert(a,2,0,axis=1)
dot_product = np.dot(a,b.T)
# Reshape a**2 to check the dot product values corresponding to exact values in the x, y coordinates
sum_reshaped = np.sum(a**2,axis=1).reshape(a.shape[0],1)
# Match for values for indivisual elements in a. Can be used if you want z coordinates corresponding to some x, y separately
indivisual_indices = ( dot_product == np.tile(sum_reshaped,b.shape[0]) )
# Take OR of column values and take z if atleast one x,y present
indices = np.any(indivisual_indices, axis=0)
print(b[:,2][indices]) # prints [10 11 12 14]

Python - dealing with numpy functions

I was given a question: I'm given 100000 sequences of 1000 coin tosses arranged in a matrix. To generate the data use the commands:
import numpy
data = numpy.random.binomial(1, 0.25, (100000,1000))
Now I need for the first 5 sequences of 1000 tosses (the first 5 rows in data) to plot (using pylab) the estimate Xm which is the sum of i from 1 to m. (Meaning the sum of all tosses up to m)
Now I was trying to do the following:
data = numpy.random.binomial(1, 0.25, (100000,1000))
x = numpy.linspace(1,1000,1000, int) // in order to create an array of 1000 ints between 1 and 1000
y = numpy.sum(data[0], x) // taking the first row
pylab.plot(x,y)
pylab.show()
And I'm getting an error
only length-1 arrays can be converted to Python scalars
furthermore, when trying to do
y = numpy.sum(data[0], tuple(x))
because I looked up the function and saw that axis needed to be a tuple of ints, I get an error
ValueError: too many values for 'axis'
So basically I'm a bit lost, could use some help.
Thanks!
I think you want to use numpy.cumsum. Furthermore, axis needs to be an integer, not an array or a tuple of an array (I think this explains the errors).
This should work:
import numpy as np
import pylab
data = np.random.binomial(1, 0.25, (100000,1000))
y = np.cumsum(data[:5, :], axis=1) # cumulative sum of first 5 rows along the rows (axis=1)
pylab.plot(np.arange(1,y.shape[1]+1), y.T)
pylab.show()

Store indices of neighbouring cells which fall within a certain radius

I have a very large numpy array of 1s and 0s. I want to go row by row and look for all the 1s. Once I encounter a 1, I want to store the indices of entries which fall inside a radius of five rows. This is better illustrated in the picture:
(in the picture I only show half a circle, in the real case I need the indices of the values that fall inside the entire circle)
Once I collect the indices, I go to the next 1 in the array and do the same. Once I finish looping through the array I want to set all the values of the collected indices which are not 1 to 1. In a sense, I am creating a buffer around all 1s with a radius of 5 columns.
for row in myarray:
for column in myarray:
dist = math.sqrt(row**2+column**2)
if dist <= 5
.........store the indices of the neighbouring cells
Can you please give me a suggestion how to accomplish this?
The operation you are describing is called dilation. I you have scipy, you could use ndimage.binary_dilation to obtain the result:
import numpy as np
import scipy.ndimage as ndimage
import matplotlib.pyplot as plt
arr = np.zeros((21, 21))
arr[5, 5] = arr[15, 15] = 1
i, j = np.ogrid[:11, :11]
# struct = ((i-5)**2 + (j-5)**2 <= 40)
struct = np.abs(i-5)+ np.abs(j-5) <= 8
result = ndimage.binary_dilation(arr, structure=struct)
plt.imshow(result, interpolation='nearest')
plt.show()

Categories

Resources