Fitting Distributions in Scipy Based on Frequency Data Efficiently - python

I have some data that I want to fit to a distribution. The data is given by the frequency. What I mean is, I have every event that I have observed and the number of times that I have observed it. So something like:
data = [(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)]
where the first number in each tuple is the event I have observed, and the second number is the total observations for that event.
With Scipy, I can fit (for example) a lognormal distribution using a call to scipy.stats.lognorm.fit. However, this routine expects to see a list of all of the observations, not the frequencies. I can fit the distribution like this:
import scipy
temp_data = []
for x in data:
temp_data += [x[0]] * x[1]
params = scipy.stats.lognorm.fit(temp_data)
but wow, that seems horribly inefficient.
Is there a to fit a distribution, in Scipy or other similar tool, based upon the frequencies? If not, is there a better way to fit the distribution without having to create a potentially giant list of values?

Unfortunately, looking at the source, it seems like the 'materialized' aspect of the data is hardcoded. The function's not that complicated, though, so you could make your own version. TBH if your total N is still manageable I'd probably just do data = np.array(data); expanded_data = np.repeat(data[:,0], data[:,1]) despite the inefficiency, because life is short.
Another alternative would be to use pomegranate, which supports passing weights:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import pomegranate as pg
data = [(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)]
data = np.array(data)
expanded = np.repeat(data[:,0], data[:,1].astype(int))
scipy_shape, _, scipy_scale = scipy_params = scipy.stats.lognorm.fit(expanded, floc=0)
scipy_sigma, scipy_mu = scipy_shape, np.log(scipy_scale)
pg_dist = pg.LogNormalDistribution(0, 1)
pg_dist.fit(data[:,0], weights=data[:,1])
pg_mu, pg_sigma = pg_dist.parameters
fig = plt.figure()
ax = fig.add_subplot(111)
x = np.linspace(0.1, 10, 100)
ax.plot(data[:,0], data[:, 1] / data[:,1].sum(), label="freq")
ax.plot(x, scipy.stats.lognorm(*scipy_params).pdf(x),
label=r"scipy: $\mu$ {:1.3f} $\sigma$ {:1.3f}".format(scipy_mu, scipy_sigma), alpha=0.5)
ax.plot(x, pg_dist.probability(x),
label=r"pomegranate: $\mu$ {:1.3f} $\sigma$ {:1.3f}".format(pg_mu, pg_sigma), linestyle='--', alpha=0.5)
ax.legend(loc='upper right')
fig.savefig("compare.png")
gives me

You can draw a random sample according to you frequency distribution, and fit that:
import scipy
import numpy as np
data = np.array(
[(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)],
dtype=float,
)
values = data[0]
weights = data[1]
seed = 87
gen = np.random.default_rng(seed)
sample = gen.choices(
values, size=500, p=weights/np.sum(weights))
params = scipy.stats.lognorm.fit(values)

Related

Calculating angles of body skeleton in video using OpenPose

Disclaimer: This question is regarding OpenPose but the key here is actually to figure how to use the output (coordinates stored in the JSON) and not how to use OpenPose, so please consider reading it to the end.
I have a video of a person from the side on a bike (profile of him sitting so we see the right side). I use the OpenPose to extract the coordinates of the skeleton. The OpenPose provides the coordinates in a JSON file looking like (see docs for explanation):
{
"version": 1.3,
"people": [
{
"person_id": [
-1
],
"pose_keypoints_2d": [
594.071,
214.017,
0.917187,
523.639,
216.025,
0.797579,
519.661,
212.063,
0.856948,
539.251,
294.394,
0.873084,
619.546,
304.215,
0.897219,
531.424,
221.854,
0.694434,
550.986,
310.036,
0.787151,
625.477,
339.436,
0.845077,
423.656,
319.878,
0.660646,
404.111,
321.807,
0.650697,
484.434,
437.41,
0.85125,
404.13,
556.854,
0.791542,
443.261,
319.801,
0.601241,
541.241,
370.793,
0.921286,
502.02,
494.141,
0.799306,
592.138,
198.429,
0.943879,
0,
0,
0,
562.742,
182.698,
0.914112,
0,
0,
0,
537.25,
504.024,
0.530087,
535.323,
500.073,
0.526998,
486.351,
500.042,
0.615485,
449.168,
594.093,
0.700363,
431.482,
594.156,
0.693443,
386.46,
560.803,
0.803862
],
"face_keypoints_2d": [],
"hand_left_keypoints_2d": [],
"hand_right_keypoints_2d": [],
"pose_keypoints_3d": [],
"face_keypoints_3d": [],
"hand_left_keypoints_3d": [],
"hand_right_keypoints_3d": []
}
]
}
From what I understand, each JSON is a frame of the video.
My goal is to detect the angles of specific coordinates like right knee, right arm, etc. For example:
openpose_angles = [(9, 10, 11, "right_knee"),
(2, 3, 4, "right_arm")]
This is based on the following OpenPose skeleton dummy:
What I did is to calculate the angle between three coordinates (using Python):
temp_df = json.load(open(os.path.join(jsons_dir, file)))
listPoints = list(zip(*[iter(temp_df['people'][person_number]['pose_keypoints_2d'])] * 3))
count = 0
lmList2 = {}
for x,y,c in listPoints:
lmList2[count]=(x,y,c)
count+=1
p1=angle_cords[0]
p2=angle_cords[1]
p3=angle_cords[2]
x1, y1 ,c1= lmList2[p1]
x2, y2, c2 = lmList2[p2]
x3, y3, c3 = lmList2[p3]
# Calculate the angle
angle = math.degrees(math.atan2(y3 - y2, x3 - x2) -
math.atan2(y1 - y2, x1 - x2))
if angle < 0:
angle += 360
This method I saw on some blog (which I forgot where), but was related to OpenCV instead of OpenPose (not sure if makes the difference), but see angles that do not make sense. We showed it to our teach and he suggested us to use vectors to calculate the angles, instead of using math.atan2. But we got confued on how to implment this.
To summarize, here is the question - What will be the best way to calculate the angles? How to calculate them using vectors?
Your teacher is right. I suspect the problem is that 3 points can make up 3 different angles depending on the order. Just consider the angles in a triangle. Also you seem to ignore the 3rd coordinate.
Reconstruct the Skeleton
In your picture you indicate that the edges/bones of the skeleton are
edges = {(0, 1), (0, 15), (0, 16), (1, 2), (1, 5), (1, 8), (2, 3), (3, 4), (5, 6), (6, 7), (8, 9), (8, 12), (9, 10), (10, 11), (11, 22), (11, 24), (12, 13), (13, 14), (14, 19), (14, 21), (15, 17), (16, 18), (19, 20), (22, 23)}
I get the points from your json file with
np.array(pose['people'][0]['pose_keypoints_2d']).reshape(-1,3)
Now I plot that ignoring the 3rd component to get an idea what I am working with. Notice that this does not change the proportions much since the 3rd component is really small compared to the others.
One definitely recognizes an upside down man. I notice that there seems to be some kind of artifact but I suspect this is just an error in recognition and would be better in an other frame.
Calculate the Angle
Recall that the dot product divided by the product of the norm gives the cosine of the angle. See the wikipedia article on dot product. I'll include the relevant picture from that article. So now I can get the angle of two joined edges like this.
def get_angle(edge1, edge2):
assert tuple(sorted(edge1)) in edges
assert tuple(sorted(edge2)) in edges
edge1 = set(edge1)
edge2 = set(edge2)
mid_point = edge1.intersection(edge2).pop()
a = (edge1-edge2).pop()
b = (edge2-edge1).pop()
v1 = points[mid_point]-points[a]
v2 = points[mid_point]-points[b]
angle = (math.degrees(np.arccos(np.dot(v1,v2)
/(np.linalg.norm(v1)*np.linalg.norm(v2)))))
return angle
For example if you wanted the elbow angles you could do
get_angle((3, 4), (2, 3))
get_angle((5, 6), (6, 7))
giving you
110.35748420197164
124.04586139643376
Which to me makes sense when looking at my picture of the skeleton. It's a bit more than a right angle.
What if I had to calculate the angle between two vectors that do not share one point?
In that case you have to be more careful because in that case the vectors orientation matters. Firstly here is the code
def get_oriented_angle(edge1, edge2):
assert tuple(sorted(edge1)) in edges
assert tuple(sorted(edge2)) in edges
v1 = points[edge1[0]]-points[edge1[1]]
v2 = points[edge2[0]]-points[edge2[1]]
angle = (math.degrees(np.arccos(np.dot(v1,v2)
/(np.linalg.norm(v1)*np.linalg.norm(v2)))))
return angle
As you can see the code is much easier because I don't order the points for you. But it is dangerous since there are two angles between two vectors (if you don't consider their orientation). Make sure both vectors point in the direction of the points you're considering the angle at (both in the opposite direction works too).
Here is the same example as above
get_oriented_angle((3, 4), (2, 3)) -> 69.64251579802836
As you can see this does not agree with get_angle((3, 4), (2, 3))! If you want the same result you have to put the 3 first (or last) in both cases.
If you do
get_oriented_angle((3, 4), (3, 2)) -> 110.35748420197164
It is the same angle as above.

Why is my array coming out as shape: (6, 1, 2) when it is made of two (6, ) arrays?

I'm trying to import data from an excel and create an array pos with 6 rows and two columns. Later, when I go to index the array pos[0][1], I get an error: IndexError: index 1 is out of bounds for axis 0 with size 1.
I looked at the shape of my array and it returns (6, 1, 2). I was expecting to get (6, 2). The individual shapes of the arrays which make up pos are (6, ) and (6, ) which I don't really understand, why not (6, 1)? Don't quite understand the difference between the two.
irmadata = pd.read_excel("DangerZone.xlsx")
irma_lats = irmadata["Average Latitude"].tolist()
irma_longs = irmadata["Average Longitude"].tolist()
shipdata = pd.read_excel("ShipPositions.xlsx")
ship_lats = shipdata["Latitude"].to_numpy() ## these are the (6, ) arrays
ship_longs = shipdata["Longitude"].to_numpy()
pos = np.array([[ship_lats], [ship_longs]], dtype = "d").T
extent = [-10, -90, 0, 50]
ax = plot.axes(projection = crs.PlateCarree())
ax.stock_img()
ax.add_feature(cf.COASTLINE)
ax.coastlines(resolution = "50m")
ax.set_title("Base Map")
ax.set_extent(extent)
ax.plot(irma_longs, irma_lats)
for i in range(len(ship_lats)):
lat = pos[i][0]
lon = pos[i][1] ## This is where my error occurs
ax.plot(lon, lat, 'o', label = "Ship " + str(i+1))
plot.show()
Obviously, I could just index pos[0][0][1] however, I'd like to know why I'm getting this issue. I'm coming from MATLAB so I suppose a lot of my issues will stem from differences in how numpy and MATLAB work, and hence any tips would also be appreciated!
I solved it, I didn't realise I could just use single square brackets for combining my two column arrays. So, changing pos = np.array([ship_lats], [ship_longs]], dtype = "d").T to pos = np.array([ship_lats, ship_longs], dtype = "d").T worked.

Why is irfftn(rfftn(x)) not equal to x?

If the trailing dimension of an array x is odd, the transform y = irfftn(rfftn(x)) does not have the same shape as the input array. Is this by design? And if so, what is the motivation? Example code is below.
import numpy as np
shapes = [(10, 10), (11, 11), (10, 11), (11, 10)]
for shape in shapes:
x = np.random.uniform(0, 1, shape)
y = np.fft.irfftn(np.fft.rfftn(x))
if x.shape != y.shape:
print("expected shape %s but got %s" % (shape, y.shape))
# Output
# expected shape (11, 11) but got (11, 10)
# expected shape (10, 11) but got (10, 10)
You need to pass second parameter x.shape
in your case the code will looks like:
import numpy as np
shapes = [(10, 10), (11, 11), (10, 11), (11, 10)]
for shape in shapes:
x = np.random.uniform(0, 1, shape)
y = np.fft.irfftn(np.fft.rfftn(x),x.shape)
if x.shape != y.shape:
print("expected shape %s but got %s" % (shape, y.shape))
from the docs
This function computes the inverse of the N-dimensional discrete
Fourier Transform for real input over any number of axes in an
M-dimensional array by means of the Fast Fourier Transform (FFT). In
other words, irfftn(rfftn(a), a.shape) == a to within numerical
accuracy. (The a.shape is necessary like len(a) is for irfft, and for
the same reason.)
x.shape descriptions from the same docs:
s : sequence of ints, optional Shape (length of each transformed axis)
of the output (s[0] refers to axis 0, s[1] to axis 1, etc.). s is also
the number of input points used along this axis, except for the last
axis, where s[-1]//2+1 points of the input are used. Along any axis,
if the shape indicated by s is smaller than that of the input, the
input is cropped. If it is larger, the input is padded with zeros. If
s is not given, the shape of the input along the axes specified by
axes is used.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.fft.irfftn.html

How to create a numpy array from two lists of tuples, but only when the tuples are the same

For image analysis i loaded up a float image with scipy imread.
Next, i had scipys argrelmax search for local maxima in axis 0 and 1 and stored the results as arrays of tuples.
data = msc.imread(prediction1, 'F')
datarelmax_0 = almax(data, axis = 0)
datarelmax_1 = almax(data, axis = 1)
how can i create a numpy array from both lists which contains only the tuples that are in both list?
Edit:
argrelmax creates a tuple with two arrays:
datarelmax_0 = ([1,2,3,4,5],[6,7,8,9,10])
datarelmax_1 = ([11,2,13,14,5], [11,7,13,14,10])
in want to create a numpy array that looks like:
result_ar[(2,7),(5,10)]
How about this "naive" way?
import numpy as np
result = np.array([x for x in datarelmax_0 if x in datarelmax_1])
Pretty simple. Maybe there's a better/faster/fancier way by using some numpy methods but this should work for now.
EDIT:
To answer your edited question, you can do this:
result = [x for x in zip(datarelmax_0[0], datarelmax_0[1]) if x in zip(datarelmax_1[0], datarelmax_1[1])]
This gives you
result = [(2, 7), (5, 10)]
If you convert it to a numpy array by using
result = np.array(result)
it looks like this:
result = array([[ 2, 7],
[ 5, 10]])
In case you are interested in what zip does:
>>> zip(datarelmax_0[0], datarelmax_0[1])
[(1, 6), (2, 7), (3, 8), (4, 9), (5, 10)]
>>> zip(datarelmax_1[0], datarelmax_1[1])
[(11, 11), (2, 7), (13, 13), (14, 14), (5, 10)]

Find all nearest neighbors within a specific distance

I have a large list of x and y coordinates, stored in an numpy array.
Coordinates = [[ 60037633 289492298]
[ 60782468 289401668]
[ 60057234 289419794]]
...
...
What I want is to find all nearest neighbors within a specific distance (lets say 3 meters) and store the result so that I later can do some further analysis on the result.
For most packages I found it is necessary to decided how many NNs should be found but I just want all within the set distance.
How can I achieve something like that and what is the fastest and best way to achieve something like that for a large dataset (some million points)?
You could use a scipy.spatial.cKDTree:
import numpy as np
import scipy.spatial as spatial
points = np.array([(1, 2), (3, 4), (4, 5)])
point_tree = spatial.cKDTree(points)
# This finds the index of all points within distance 1 of [1.5,2.5].
print(point_tree.query_ball_point([1.5, 2.5], 1))
# [0]
# This gives the point in the KDTree which is within 1 unit of [1.5, 2.5]
print(point_tree.data[point_tree.query_ball_point([1.5, 2.5], 1)])
# [[1 2]]
# More than one point is within 3 units of [1.5, 1.6].
print(point_tree.data[point_tree.query_ball_point([1.5, 1.6], 3)])
# [[1 2]
# [3 4]]
Here is an example showing how you can
find all the nearest neighbors to an array of points, with one call
to point_tree.query_ball_point:
import numpy as np
import scipy.spatial as spatial
import matplotlib.pyplot as plt
np.random.seed(2015)
centers = [(1, 2), (3, 4), (4, 5)]
points = np.concatenate([pt+np.random.random((10, 2))*0.5
for pt in centers])
point_tree = spatial.cKDTree(points)
cmap = plt.get_cmap('copper')
colors = cmap(np.linspace(0, 1, len(centers)))
for center, group, color in zip(centers, point_tree.query_ball_point(centers, 0.5), colors):
cluster = point_tree.data[group]
x, y = cluster[:, 0], cluster[:, 1]
plt.scatter(x, y, c=color, s=200)
plt.show()

Categories

Resources