joint probability distribution using Python - python

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
I'm looking to figure out how to find the variance of X + Y as well as the expectation (E) of X+Y, E[X+Y] where X is the rows in the dataset above and Y is the columns.

Related

Rearranging License Plate characters based on country

I am doing a License/Number plate recognition project and I'm on the stage of completion but there is a small problem, I have successfully recognized the characters, consider the below example:
This is an input image, I got the prediction as 2791 2g rj14
As you can, the ocr did a great job but the arrangement is destroyed (DESTROYING the whole purpose). Sometimes it does outputs in the correct sequence but sometimes it does not, so when it does not output in the correct sequence I'm trying to develop an algorithm which will take the predicted num_plate string as input and rearrange it on the basis of my country (India).
Below are some images which tell us about the format of Indian Number/License Plate.
Also, I have collected all the states but for right now, I just want to do for only the 3 states which are: Delhi (DL), Haryana (HR), UttarPradesh (UP). More info : https://en.wikipedia.org/wiki/List_of_Regional_Transport_Office_districts_in_India
total_states_list = [
'AN','AP','AR','AS','BR','CG','CH','DD','DL','DN','GA','GJ','HR','HP','JH','JK','KA','KL',
'LD','MH','ML','MN','MP','MZ','NL','OD','PB','PY','RJ','SK','TN','TR','TS','UK','UP','WB'
]
district_codes = {
'DL': ['1','2','3','4','5','6','7','8','9','10','11','12','13'],
'HR': [01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
]
}
So, I have been trying but cannot come up with an algorithm which rearranges the sequence in the required sequence if it is not. Any help would be really appreciated.
Details about OCR
Using keras-ocr, I'm getting the following output for the input image:
[
('hrlz', array([[ 68.343796, 42.088367],
[196.68803 , 26.907867],
[203.00832 , 80.343094],
[ 74.66408 , 95.5236 ]], dtype=float32)),
('c1044', array([[ 50.215836, 113.09602 ],
[217.72466 , 92.58473 ],
[224.3968 , 147.07387 ],
[ 56.887985, 167.58516 ]], dtype=float32))
]
source: https://keras-ocr.readthedocs.io/en/latest/examples/using_pretrained_models.html
Inside the keras_ocr.tools.drawAnnotations they are I think getting the predictions boxes. So I located this file and found the implementation of drawAnnotations function and here it is:
def drawAnnotations(image, predictions, ax=None):
if ax is None:
_, ax = plt.subplots()
ax.imshow(drawBoxes(image=image, boxes=predictions, boxes_format='predictions'))
predictions = sorted(predictions, key=lambda p: p[1][:, 1].min())
left = []
right = []
for word, box in predictions:
if box[:, 0].min() < image.shape[1] / 2:
left.append((word, box))
else:
right.append((word, box))
ax.set_yticks([])
ax.set_xticks([])
for side, group in zip(['left', 'right'], [left, right]):
for index, (text, box) in enumerate(group):
y = 1 - (index / len(group))
xy = box[0] / np.array([image.shape[1], image.shape[0]])
xy[1] = 1 - xy[1]
ax.annotate(s=text,
xy=xy,
xytext=(-0.05 if side == 'left' else 1.05, y),
xycoords='axes fraction',
arrowprops={
'arrowstyle': '->',
'color': 'r'
},
color='r',
fontsize=14,
horizontalalignment='right' if side == 'left' else 'left')
return ax
How should I go about and get the (x,y,w,h) and then somehow sort/print according to y/x of number_plate bbox?
EDIT - 2
I managed to get the bounding box of characters as you can see in the image below:
using the function cv2.polylines(box), where box are the same coordinates where I have pasted the output earlier. Now how can I print them in a sequence like, left to right... using the y/x as suggested by people in the comments.
If you can get the coordinates of each identified text box, then:
Rotate the coordinates so the boxes are parallel with the X-axis
Scale the Y-coordinates so they can be rounded to integers, so that boxes that are side-by-side will get the same integer Y-coordinate (like a line number)
Sort the data by Y, then X coordinate
Extract the texts in that order
Here is an example of such sequence:
data = [
('hrlz', [[ 68.343796, 42.088367],
[196.68803 , 26.907867],
[203.00832 , 80.343094],
[ 74.66408 , 95.5236 ]]),
('c1044',[[ 50.215836, 113.09602 ],
[217.72466 , 92.58473 ],
[224.3968 , 147.07387 ],
[ 56.887985, 167.58516 ]])
]
# rotate data to align with X-axis
a, b = data[0][1][:2]
dist = ((b[1] - a[1]) ** 2 + (b[0] - a[0]) ** 2) ** 0.5
sin = (b[1] - a[1]) / dist
cos = (b[0] - a[0]) / dist
data = [
(text, [(x * cos + y * sin, y * cos - x * sin) for x, y in box]) for text, box in data
]
# scale Y coordinate to integers
a, b = data[0][1][1:3]
height = b[1] - a[1]
data = [
(round(box[0][1] / height), box[0][0], text)
for text, box in data
]
# sort by Y, then X
data.sort()
# Get text in the right order
print("".join(text for _, _, text in data))
This assumes that the points of the boxes are given in the following clockwise order:
top-left, top-right, bottom-right, bottom-left

Python line of best fit return value

I am trying to fit a linear line to my graph:
the x values of the red line(raw data) are:
array([ 0.03591733, 0.16728212, 0.49537727, 0.96912459,
1. , 1. , 1.11894521, 1.93042113,
2.94284656, 10.98699942])
and the y values are
array([ 0.0016241 , 0.00151784, 0.00155586, 0.00174498, 0.00194872,
0.00189413, 0.00208325, 0.00218074, 0.0021281 , 0.00243127])
my code for the line of best fit is:
LineFit = np.polyfit(x, y, 1)
p = np.poly1d(LineFit)
plt.plot(x,y,'r-')
plt.plot(x,p(y),'--')
plt.show()
However, my LineFit returns me
array([ 7.03475069e-05, 1.76565292e-03])
which supposed to be interception and gradient according to the definition of polyfit (lower to higher order coefficient)
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.polynomial.polynomial.polyfit.html
but seems like its the opposite (gradient and interception) from the plot.
Could someone explain this to me?
You are looking at a different doc. See https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.polyfit.html#numpy.polyfit:
...Fit a polynomial p(x) = p[0] * x**deg + ... + p[deg] of degree deg to points (x, y).
So in your example, it is p(x) = p[0] * x + p[1], which is exactly gradient and interception...

Tensorflow Grab Predictions and Indices for values above thresholds

What is the easiest way to grab the corresponding prediction values and indices based on those above a certain threshold?
Consider this problem:
sess = tf.InteractiveSession()
predictions = tf.constant([[ 0.32957435, 0.82079124, 0.54503286, 0.51966476, 0.63359714,
0.92034972, 0.13774526, 0.45154464, 0.18284607, 0.14604568],
[ 0.78612137, 0.98291659, 0.4841609 , 0.63260579, 0.21568334,
0.82978213, 0.05054879, 0.09517837, 0.28309393, 0.01788473],
[ 0.05706763, 0.24366784, 0.04608512, 0.32987678, 0.2342416 ,
0.91725373, 0.60084391, 0.51787591, 0.74161232, 0.30830121],
[ 0.67310858, 0.6250236 , 0.42477703, 0.37107778, 0.65123832,
0.97282803, 0.59533679, 0.49564457, 0.54935825, 0.63008392],
[ 0.70233917, 0.48129809, 0.59114349, 0.63535333, 0.71188867,
0.4799161 , 0.90896237, 0.86089945, 0.47896886, 0.83451629],
[ 0.82923532, 0.8950938 , 0.99231505, 0.05526769, 0.98151541,
0.18153167, 0.63851702, 0.07426929, 0.91846335, 0.81246626],
[ 0.12850153, 0.23018432, 0.29871917, 0.71228445, 0.13235569,
0.41061044, 0.98215759, 0.90024149, 0.53385031, 0.92247963],
[ 0.87011361, 0.44218826, 0.01772344, 0.87317121, 0.52231467,
0.86476815, 0.25352192, 0.31709731, 0.38249743, 0.74694788],
[ 0.15262914, 0.49544573, 0.49644637, 0.07461977, 0.13706958,
0.18619633, 0.86163998, 0.03700352, 0.51173556, 0.40018845]])
score_idx = tf.where(predictions > 0.8)
scores = tf.SparseTensor(score_idx, tf.gather_nd(predictions, score_idx), dense_shape=tf.shape(predictions, out_type=tf.int64))
dense_scores = tf.sparse_tensor_to_dense(scores)
print(sess.run([scores, dense_scores]))
I can easily get a sparse tensor that has all of the predictions above 0.8, but ultimately I am looking to return two separate 1D tensors:
Predicted Indices = list of indexes above threshold (0.8 in example)
Scores = the scores for the corresponding examples
So for the first row which is:
[ 0.32957435, 0.82079124, 0.54503286, 0.51966476, 0.63359714,
0.92034972, 0.13774526, 0.45154464, 0.18284607, 0.14604568]
I am looking to return:
predicted_indices = [1,5]
scores = [0.821, 0.920]
Is there a simple solution that I am missing?

How to find Inverse Cumulative Distribution Function of discrete functions in Python

I am trying to find the Inverse CDF function of discrete probability distribution in Python and then plot it. My CDF is derived from the following numpy output:
array([ 0.228157, 0.440671, 0.588515, 0.683326, 0.740365, 0.783288,
0.81362 , 0.840518, 0.859213, 0.876764, 0.889355, 0.89813 ,
0.909194, 0.916443, 0.9256 , 0.930369, 0.938572, 0.942387,
0.946012, 0.951353, 0.954405, 0.956694, 0.965088, 0.966614,
0.96814 , 0.969475, 0.970047, 0.971001, 0.971573, 0.973099,
0.974816, 0.975388, 0.977105, 0.984163, 0.984354, 0.984736,
0.98569 , 0.985881, 0.986072, 0.986644, 0.990269, 0.990651,
0.990842, 0.993322, 0.993704, 0.994467, 0.995039, 0.995802,
0.996184, 0.996375, 0.996566, 0.996757, 0.997329, 0.99752 ,
0.997711, 0.997902, 0.998093, 0.998284, 0.998475, 0.998666,
0.998857, 0.999239, 0.999621, 0.999812, 1.00000])
I tried rv_discrete.ppf(q, *args, **kwds), but it works for random variables, which is not my case.
Since you have lots of points perhaps you would find linear interpolation acceptable between adjacent points. Do binary search to find the points that are adjacent to the probability you seek in the first place. Like this, with some tidying up:
import numpy as np
CDF = np.array([ 0.228157, 0.440671, 0.588515, 0.683326, 0.740365, 0.783288, 0.81362 , 0.840518, 0.859213, 0.876764, 0.889355, 0.89813 , 0.909194, 0.916443, 0.9256 , 0.930369, 0.938572, 0.942387, 0.946012, 0.951353, 0.954405, 0.956694, 0.965088, 0.966614, 0.96814 , 0.969475, 0.970047, 0.971001, 0.971573, 0.973099, 0.974816, 0.975388, 0.977105, 0.984163, 0.984354, 0.984736, 0.98569 , 0.985881, 0.986072, 0.986644, 0.990269, 0.990651, 0.990842, 0.993322, 0.993704, 0.994467, 0.995039, 0.995802, 0.996184, 0.996375, 0.996566, 0.996757, 0.997329, 0.99752 , 0.997711, 0.997902, 0.998093, 0.998284, 0.998475, 0.998666, 0.998857, 0.999239, 0.999621, 0.999812, 1.00000] )
## inverse of .3
index = np.searchsorted(CDF, .3)
print ( index )
print ( (.3 - CDF [ index-1 ] ) / ( CDF [ index ] - CDF [ index-1 ] ) )
Output is this.
1
0.338062433534

Python PCA - projection into lower dimensional space

i am trying to implement PCA, which worked well regarding the intermediate results such as eigenvalues and eigenvectors. Yet when i try to project the data (3 dimensional) into the a 2D-principal-component space, the result is wrong.
I spent a lot of time comparing my code to other implementations such as:
http://sebastianraschka.com/Articles/2014_pca_step_by_step.html
Yet after a long time there is no progress and I can not find the mistake. I assume the problem is a simple coding mistake due to the correct intermediate results.
Thanks in advance for anyone who actually read this question and thanks even more to those who give helpful comments/answers.
My code is as follows:
import numpy as np
class PCA():
def __init__(self, X):
#center the data
X = X - X.mean(axis=0)
#calculate covariance matrix based on X where data points are represented in rows
C = np.cov(X, rowvar=False)
#get eigenvectors and eigenvalues
d,u = np.linalg.eigh(C)
#sort both eigenvectors and eigenvalues descending regarding the eigenvalue
#the output of np.linalg.eigh is sorted ascending, therefore both are turned around to reach a descending order
self.U = np.asarray(u).T[::-1]
self.D = d[::-1]
**problem starts here**
def project(self, X, m):
#use the top m eigenvectors with the highest eigenvalues for the transformation matrix
Z = np.dot(X,np.asmatrix(self.U[:m]).T)
return Z
The result of my code is:
myresult
([[ 0.03463706, -2.65447128],
[-1.52656731, 0.20025725],
[-3.82672364, 0.88865609],
[ 2.22969475, 0.05126909],
[-1.56296316, -2.22932369],
[ 1.59059825, 0.63988429],
[ 0.62786254, -0.61449831],
[ 0.59657118, 0.51004927]])
correct result - such as by sklearn.PCA
([[ 0.26424835, -2.25344912],
[-1.29695602, 0.60127941],
[-3.59711235, 1.28967825],
[ 2.45930604, 0.45229125],
[-1.33335186, -1.82830153],
[ 1.82020954, 1.04090645],
[ 0.85747383, -0.21347615],
[ 0.82618248, 0.91107143]])
The input is defined as follows:
X = np.array([
[-2.133268233289599,0.903819474847349,2.217823388231679,-0.444779660856219,-0.661480010318842,-0.163814281248453,-0.608167714051449, 0.949391996219125],
[-1.273486742804804,-1.270450725314960,-2.873297536940942, 1.819616794091556,-2.617784834189455, 1.706200163080549,0.196983250752276,0.501491995499840],
[-0.935406638147949,0.298594472836292,1.520579082270122,-1.390457671168661,-1.180253547776717,-0.194988736923602,-0.645052874385757,-1.400566775105519]]).T
You need to center your data by subtracting the mean before you project it onto the new basis:
mu = X.mean(0)
C = np.cov(X - mu, rowvar=False)
d, u = np.linalg.eigh(C)
U = u.T[::-1]
Z = np.dot(X - mu, U[:2].T)
print(Z)
# [[ 0.26424835 -2.25344912]
# [-1.29695602 0.60127941]
# [-3.59711235 1.28967825]
# [ 2.45930604 0.45229125]
# [-1.33335186 -1.82830153]
# [ 1.82020954 1.04090645]
# [ 0.85747383 -0.21347615]
# [ 0.82618248 0.91107143]]

Categories

Resources