Python - dealing with numpy functions - python

I was given a question: I'm given 100000 sequences of 1000 coin tosses arranged in a matrix. To generate the data use the commands:
import numpy
data = numpy.random.binomial(1, 0.25, (100000,1000))
Now I need for the first 5 sequences of 1000 tosses (the first 5 rows in data) to plot (using pylab) the estimate Xm which is the sum of i from 1 to m. (Meaning the sum of all tosses up to m)
Now I was trying to do the following:
data = numpy.random.binomial(1, 0.25, (100000,1000))
x = numpy.linspace(1,1000,1000, int) // in order to create an array of 1000 ints between 1 and 1000
y = numpy.sum(data[0], x) // taking the first row
pylab.plot(x,y)
pylab.show()
And I'm getting an error
only length-1 arrays can be converted to Python scalars
furthermore, when trying to do
y = numpy.sum(data[0], tuple(x))
because I looked up the function and saw that axis needed to be a tuple of ints, I get an error
ValueError: too many values for 'axis'
So basically I'm a bit lost, could use some help.
Thanks!

I think you want to use numpy.cumsum. Furthermore, axis needs to be an integer, not an array or a tuple of an array (I think this explains the errors).
This should work:
import numpy as np
import pylab
data = np.random.binomial(1, 0.25, (100000,1000))
y = np.cumsum(data[:5, :], axis=1) # cumulative sum of first 5 rows along the rows (axis=1)
pylab.plot(np.arange(1,y.shape[1]+1), y.T)
pylab.show()

Related

What is an efficient way to calculate the mean of values in the bin with maximum frequency for large number of numpy arrays?

I am looking for an efficient way to do the following calculations on millions of arrays. For the values in each array, I want to calculate the mean of the values in the bin with most frequency as demonstrated below. Some of the arrays might contain nan values and other values are float. The loop for my actual data takes too long to finish.
import numpy as np
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
bin_values=np.linspace(0, 10, 21)
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0])
values = np.zeros(array.shape[0])
for i in range(array.shape[0]):
values[i] = np.nanmean(array[i][(array[i]>=bin_start[i])*(array[i]<bin_end[i])])
Also, when I run the above code I get three warnings. The first is 'RuntimeWarning: Mean of empty slice' for the line where I calculate the value variable. I set a condition in case I have all nan values to skip this line, but the warning did not go away. I was wondering what the reason is. The other two warnings are for when the less and greater_equal conditions do not meet which make sense to me since they might be nan values.
The arrays that I want to run this algorithm on are independent, but I am already processing them with 12 separate scripts. Running the code in parallel would be an option, however, for now I am looking to improve the algorithm itself.
The reason that I am using lambda function is to run numpy.histogram over an axis since it seems the histogram function does not take an axis as an option. I was able to use a mask and remove the loop from the code. The code is 2 times faster now, but I think it still can be improved more.
I can explain what I want to do in more detail by an example if it clarifies it. Imagine I have 36 numbers which are greater than 0 and smaller than 20. Also, I have bins with equal distance of 0.5 over the same interval (0.0_0.5, 0.5_1.0, 1.0_1.5, … , 19.5_20.0). I want to see if I put the 36 numbers in their corresponding bin what would be the mean of the numbers within the bin which contain the most number of numbers.
Please post your solution if you can think of a faster algorithm.
import numpy as np
# creating an array to test the algorithm
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
# the algorithm
bin_values=np.linspace(0, 10, 21)
# calculating the frequency of each bin
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0]))
# creating a mask to get the mean over the bin with maximum frequency
mask = (array>=bin_start) * (array<bin_end)
mask_nan = np.tile(np.nan, (mask.shape[0], mask.shape[1]))
mask_nan[mask] = 1
v = np.nanmean(array * mask_nan, axis = 1)

Multiplying large matrices in Python 3

I have a homework that involves Python code and multiplying some Numpy arrays of length 100, but the following error comes up:
ValueError: maximum supported dimension for an ndarray is 32, found 100
Here's my code, xs and ys are two lists of length 100.
%inline pylab
import numpy as np
Y = np.array(ys).reshape(len(ys),1)
X = np.array([len(xs)*[1],xs]).transpose()
B = linalg.inv(X.transpose().dot(X)).dot(X.transpose(X)).dot(Y)
i guess you just have a typo in X.transpose(X) -> linalg.inv(X.transpose().dot(X)).dot( X.transpose(X) ).dot(Y)
I don't think you really want to specify the *axes along which to transpose as "X"
not sure what you're trying to compute but probably X.transpose() is what you mean or maybe you wanted X.transpose().dot(X) ..

Replicating a matrix in pandas or numpy to a certain size

I have a matrix A which is (41, 41) which is a dataframe.
B is a matrix of size (7154, 8240), ndarray.
I want replicate A (keeping the whole 41x41 matrix intact) to the size of B. It will not fit exactly, but then it should just clip the rows that does not fit.
This is to be able to multiply A*B.
I tried this code, but I cannot multiply with a float.
repeat = pd.concat([A]*(B.shape[0]/A.shape[0]), axis=0, ignore_index=True)
filter_large = pd.concat([repeat]*(B.shape[1]/A.shape[1]), axis=1, ignore_index=True)
filter_l = filter_large.values # change from a dataframe to a numpy array
AB = A*filter_l
I should mention that I've tried numpy.resize but it does not keep the matrix intact, mixing up all rows which is not what I want.
This code will do what you ask for:
shapeMultiples = (np.ceil(B.shape[0]/A.shape[0]).astype(int), np.ceil(B.shape[1]/A.shape[1]).astype(int))
res = np.tile(A, shapeMultiples)[:B.shape[0], :B.shape[1]]
Explanation:
np.tile(A, reps) repeats the matrix A multiple times along each axis. How often it is repeated is specified for each axis in reps.
For your example it should be repeated b.shape[0]/a.shape[0] times along axis 0 and b.shape[1]/a.shape[1] times along axis 1. However you have to round these values up, to make sure it extends the size of matrix B, which is what np.ceil does. Since reps is expected to be a shape of integers but ceil returns floats, we have to cast the type to int.
In the final step we cut of the result to make it fit the size of B with [:B.shape[0], :B.shape[1]].

How to add a random number to a subsection of a numpy array?

I'm new to Python.
I have a numpy array of 3 columns and 50 rows. I want to add a value drawn from a normally distributed distribution to every number in the array except the first row. I am curious to know if there is a cleaner but also readable way to do this compared what I am currently doing? At the moment I'm using perhaps the not so elegant way:
nRows = np.shape (data)[0]
nCols = np.shape (data)[1]
x = data[0,:].copy() # Copy the first row
# Add a random number to all rows but 0
for i in range (nCols):
data[:,i] += np.random.normal (0, 0.8, nRows)
data[0,:] = x # Copy the first row back
You can assign values to indexed array. For your case, generate the 2d random array first and then directly add it to sliced data:
data[1:] += np.random.normal(0, 0.8, (nRows - 1, nCols))

Python scatter - change marker style based on entity

I'm struggling for days trying to resolve this problem: I have cartesian coordinates on the y-axis (for depth from 0 to 1) and numbers with different values on the x axis (the numbers are the firing rate of different cells populations at the given depth on y axis, so they vary randomly).
I would like to show bigger size of markers in the scatterplot corresponding to a bigger x-axis value (firing rate).
Thank you for any suggestion.
This is the code (not working).
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.cbook as cbook
x = np.genfromtxt('x_dex.csv', delimiter=',')
y = np.genfromtxt('z_dex.csv', delimiter=',')
array = [i for i in x if i > 4]
array.sort()
s = [30*2**n for n in range(len(array))];
plt.subplot(212)
plt.scatter(x,y,s=s)
plt.show()
This is unfortunately not showing the correct relation between size of marker and depth.
The line where you compute your 'size' values looks incorrect to me:
s = [30*2**n for n in range(len(array))];
This will give you a list containing:
s = [30*2**0, 30*2**1, 30*2**2, ..., 30*2**(len(array) - 1)]
The values bear no relation to y, so I assume this is not what you intended. Maybe you meant something more like this:
s = 30 * 2 ** y
There are actually several other issues here:
Don't give your variables names like array - this can lead to confusion with numpy.array. It's even worse in this case, since array is actually not an array but a Python list!
Since you're dealing with numpy arrays, it's much faster to use vectorization rather than list comprehensions. For example, you could use:
array = x[x > 4]
rather than
array = [i for i in x if i > 4]
After your list comprehension array = [i for i in x if i > 4], array will have a different number of elements to y if there are elements in array that are less than 4.
array.sort() will sort the list in place, which means that the order of the elements in array will no longer match the order of elements in y.
In fact, sorting seems rather pointless in this situation - since you're making a scatter plot the order of the points should not matter.
You're not writing MATLAB code any more, so there's no need to end lines on a semicolon (although it won't do any harm if you do).
Here's my educated guess at what you're trying to do:
import matplotlib.pyplot as plt
import numpy as np
x = np.genfromtxt('x_dex.csv', delimiter=',')
y = np.genfromtxt('z_dex.csv', delimiter=',')
# get the set of indices that will sort x in ascending order, apply these
# to both x & y
order = np.argsort(x)
x_sorted = x[order]
y_sorted = y[order]
# keep only xy pairs where x > 4
valid = x_sorted > 4
x_valid = x_sorted[valid]
y_valid = y_sorted[valid]
# compute the sizes
s = 30 * 2 ** y_valid
# plot
plt.subplot(212)
plt.scatter(x_valid, y_valid, s=s)
plt.show()

Categories

Resources