SymPy rank different from NumPy matrix rank - python

Given some SymPy matrix M
M = Matrix([
[0.000111334436666596, 0.00114870370895408, -0.000328330524152990, 5.61388353859808e-6, -0.000464532588930332, -0.000969955779635878, 1.70579589853818e-5, -5.77891177019884e-6, -0.000186812539472235, -2.37115911398055e-5],
[-0.00105346453420510, 0.000165063406707273, -0.00184449574409890, 0.000658080565333929, 0.00197652092300241, 0.000516180213512589, 9.53823860082390e-5, 0.000189858427211978, -3.80494288487685e-5, 0.000188984043643408],
[-0.00102465075104153, -0.000402915220398109, 0.00123785300884241, -0.00125808154543978, 0.000126618511490838, 0.00185985865307693, 0.000123626008509804, 0.000211557638637554, 0.000407232404255796, 1.89851719447102e-5],
[0.230813497584639, -0.209574389008468, 0.742275067362657, -0.202368828927654, -0.236683258718819, 0.183258819107153, 0.180335891933511, -0.530606389541138, -0.379368598768419, 0.334800403899511],
[-0.00102465075104153, -0.000402915220398109, 0.00123785300884241, -0.00125808154543978, 0.000126618511490838, 0.00185985865307693, 0.000123626008509804, 0.000211557638637554, 0.000407232404255796, 1.89851719447102e-5],
[0.00105346453420510, -0.000165063406707273, 0.00184449574409890, -0.000658080565333929, -0.00197652092300241, -0.000516180213512589, -9.53823860082390e-5, -0.000189858427211978, 3.80494288487685e-5, -0.000188984043643408],
[0.945967255845168, -0.0468645728473480, 0.165423896937049, -0.893045423193559, -0.519428986944650, -0.0463256408085840, -0.0257001217930424, 0.0757328764368606, 0.0541336731317414, -0.0477734271777646],
[-0.0273371493900004, -0.954100482348723, -0.0879282784854250, 0.100704543595514, -0.243312734473589, -0.0217088779350294, 0.900584332231093, 0.616061129532614, 0.0651163853434486, -0.0396603397583054],
[0.0967584768347089, -0.0877680087304911, -0.667679934757176, -0.0848411039101494, -0.0224646387789634, -0.194501966574153, 0.0755161040544943, 0.699388977592066, 0.394125039254254, -0.342798611994521],
[-0.000222668873333193, -0.00229740741790816, 0.000656661048305981, -1.12277670771962e-5, 0.000929065177860663, 0.00193991155927176, -3.41159179707635e-5, 1.15578235403977e-5, 0.000373625078944470, 4.74231822796110e-5]
])
I have calculated SymPy rank() and rref() of the matrix. Rank is 7 and rref() result is:
Matrix([
[1, 0, 0, 0, 0, 0, 0, -5.14556976678473, -3.72094268951566, 3.48581267477014],
[0, 1, 0, 0, 0, 0, 0, -5.52930150663022, -4.02230308325653, 3.79193678096199],
[0, 0, 1, 0, 0, 0, 0, 2.44893308665325, 1.83777402439421, -1.87489784909824],
[0, 0, 0, 1, 0, 0, 0, -7.33732284392352, -5.25036238623229, 4.97256759287563],
[0, 0, 0, 0, 1, 0, 0, 5.48049237370489, 3.90091366576548, -3.83642187384021],
[0, 0, 0, 0, 0, 1, 0, -10.6826798792866, -7.56560803870182, 7.45974067056387],
[0, 0, 0, 0, 0, 0, 1, -3.04726210012149, -2.66388837034592, 2.48327234504403],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Weird thing is that if I calculate rank with either NumPy or MATLAB I get value 6 and calculating rref with MATLAB I get the expected result - last 4 rows are all zero (instead of only last 3).
Does any one know where does this difference comes from and why am I unable to get correct results with SymPy? I know that rank 6 is correct because it is system of the equations where some linear dependency exist.

Looking at the eigenvalues of your matrix, the rank is indeed 6:
array([ 1.14550481e+00+0.00000000e+00j, -1.82137718e-01+6.83443168e-01j,
-1.82137718e-01-6.83443168e-01j, 2.76223053e-03+0.00000000e+00j,
-3.51138883e-04+8.61508469e-04j, -3.51138883e-04-8.61508469e-04j,
5.21160131e-17+0.00000000e+00j, -2.65160469e-16+0.00000000e+00j,
-2.67753616e-18+9.70937977e-18j, -2.67753616e-18-9.70937977e-18j])
With the sympy version I have, I get even a rank of 8, compared to the rank 6 that numpy returns.
But actually, Sympy cannot solve the eigenvalues of this matrix due to the size of the matrix (probably related to SymPy could not compute the eigenvalues of this matrix).
So one of them, Sympy, is trying to solve symbolically the equations and find the rank (based on imperfect floating point numbers), whereas the other one, numpy, uses approximations (lapack IIRC) to find the eigenvalues. By having an adequate threshold, numpy finds the proper rank, but it could have said differently with a different threshold. Sympy tried to find the rank based on an approximate system of a perfect 6 rank system and finds that it is of rank 7 or 8. It's not surprising due to the floating point difference (Sympy moves to integers to try to find the eigenvalues, for instance, instead of staying in floating point realm).

Related

Sampling from exponential Bernoulli

Bernoulli is a probability distribution. I need to sample from an exponential bernoulli and returns a binary value (i.e. either 0 or 1). I found this algorithm exponential bernoulli sampling
and i want to implement it but i do not understand the step 3 of the algorithm where :
r1 = r1 & (2^h - 1 ).
Could someone give help ?
You can use a library which implements sampling from a Bernoulli distribution, e.g., np.random.binomial (as the binomial distribution with n = 1 is the Bernoulli distribution).
import numpy as np
np.random.binomial(n=1, p=.2, size=20)
# output: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0])

Ethernet/IP device: able to read attributes using CPPPO, but cannot write

I am working to build a Python script to communicate with an EtherNet/IP device (Graco PD2K spray system). The only documentation provided by the vendor is how to configure an Allen Bradley PLC as the client to communicate with the device.
Using the following code, I can read the array of attributes at Assembly Instance 100:
from cpppo.server.enip.get_attribute import proxy_simple
via = proxy_simple('192.168.10.5')
with via:
data, = via.read( [('#4/100/3','DINT')] )
... which results in receiving back the expected array:
[0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 10, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
(39 x 32-bit integers)
When attempting to write to the attributes at Assembly instance 150, I receive True back from the controller, but the controller does not update the parameters. It is expecting 25 x 32-bit integer array:
with via:
result, = via.read([('#4/150/3=(DINT)4, 1, 0, 0, 1, 0, 0, 0, 1, 0, 10, 10, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0','#4/150/3')],1)
The output from above is:
#4/150/3=(DINT)4, 1, 0, 0, 1, 0, 0, 0, 1, 0, 10, 10, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 == True
If I add one integer to the array (or subtract, or attempt to set other than #4/150/3, I get back None, so it is clear I am close on the format and the command is getting through.
I have reached out to the vendor multiple times, and they insist it is a problem with Python (or, more specifically, they do not support Python and recommend integrating with a PLC).
I am wondering if the "Configuration" parameter at Assembly Instance 1 is the issue (see image above). I have tried multiple versions of the following code to try to write that parameter. Not fully understanding the EtherNet/IP protocol, I'm not even sure what that particular instance does -- however, that it is a parameter in an Allen-Bradley config indicates it is important in this case.
Code attempted:
result, = via.read([('#4/1/3=(USINT)0','#4/1/3')],1)
I have tried using the Molex EnIP utility, as well as something similar on SourceForge to take Python out of the equation, but results are similar. I have also tried the PyComm3 module, but I can't even get it to return Id information. I have also tried using -vvv with the CPPPO command line utils:
python -m cpppo.server.enip.get_attribute -a 192.168.10.5 '#4/150/3=(DINT)4, 1, 0, 0, 1, 0, 0, 0, 1, 0, 10, 10, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0' -vv -S
Results in (along with much more output that I don't believe is relevant):
11-09 12:11:18.119 MainThread enip.cli DETAIL issue Sending 1 (Context b'0')
11-09 12:11:18.120 MainThread enip.cli DETAIL pipeline Issuing 0/ 1; curr: 0 - last: -1 == 1 depth vs. max 0
11-09 12:11:18.124 MainThread enip.cli DETAIL __next__ Client CIP Rcvd: {
"send_data.interface": 0,
"send_data.timeout": 8,
"send_data.CPF.count": 2,
"send_data.CPF.item[0].type_id": 0,
"send_data.CPF.item[0].length": 0,
"send_data.CPF.item[1].type_id": 178,
"send_data.CPF.item[1].length": 4,
"send_data.CPF.item[1].unconnected_send.request.input": "array('B', [144, 0, 0, 0])",
"send_data.CPF.item[1].unconnected_send.request.service": 144,
"send_data.CPF.item[1].unconnected_send.request.status": 0,
"send_data.CPF.item[1].unconnected_send.request.status_ext.size": 0,
"send_data.CPF.item[1].unconnected_send.request.set_attribute_single": true
}
11-09 12:11:18.124 MainThread enip.cli DETAIL collect Receive 1 (Context b'0')
11-09 12:11:18.124 MainThread enip.cli DETAIL pipeline Completed 1/ 1; curr: 0 - last: 0 == 0 depth vs. max 0
Mon Nov 9 12:11:18 2020: 0: Single S_A_S #0x0004/150/3 == True
11-09 12:11:18.124 MainThread enip.cli DETAIL pipeline Pipelined 1/ 1; curr: 0 - last: 0 == 0 depth vs. max 0
11-09 12:11:18.125 MainThread enip.get NORMAL main 1 requests in 0.006s at pipeline depth 0; 153.919 TPS
Again, result of the request is True, but the controller does not update any of the parameters.
I'm not sure what to try next...

How to solve these two equations using sympy?

I cannot solve these two equations using sympy
eq1 = 20*x*y-10*x-4*x**3
eq2 = 10*x**2-8*y-8*y**3
solve([eq1, eq2], [x, y])
My answer is (0,0), (0,-i), (0,i), but the answer of book is (0,0), (+-2.64, 1.90), (+-0.86, 0.65).
book is calculus, 6th edition, James Stewart (section 15-7)
I'm not sure why solve isn't working here. These equations are cubic in each of x and y and since they are non-degenerate that should mean up to 9 roots are possible I think. I can find them by solving eq1 for x, eliminating x from eq2 and then solving eq2 for y. In one line that is
In [101]: sols = [{x: xi.subs(y, yi), y:yi} for xi in solve(eq1, x) for yi in solve(eq2.subs(x, xi), y)]
The full expression for some of these roots is complicated since they come from the cubic formula. I'll show approximate numerical values instead:
In [102]: for s in sols: print('(%s, %s)' % (s[x].n(3, chop=True), s[y].n(3, chop=True)))
(0, 0)
(0, -1.0*I)
(0, 1.0*I)
(-0.857, 0.647)
(-2.64, 1.90)
(-3.9*I, -2.54)
(0.857, 0.647)
(2.64, 1.90)
(3.9*I, -2.54)
Checking these roots with simplify or checksol fails for the (+-3.9*I,-2.54) roots so I'll demonstrate that they are probably solutions numerically instead:
In [103]: [eq1.evalf(subs=s, chop=True) for s in sols]
Out[103]: [0, 0, 0, 0, 0, 0, 0, 0, 0]
In [104]: [eq2.evalf(subs=s, chop=True) for s in sols]
Out[104]: [0, 0, 0, 0, 0, 0, 0, 0, 0]
The procedure recommended by Oscar can be done automatically by using the "manual=True, check=False" flags:
>>> sol = solve((eq1,eq2), check=False, manual=True)
>>> [eq1.subs(s).n(2,chop=True) for s in sol]
[0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> [eq2.subs(s).n(2,chop=True) for s in sol]
[0, 0, 0, 0, 0, 0, 0, 0, 0]

SKLearn - Unusually high performance with Random Forest using a single feature

I am using Random Forest as a binary classifier for a dataset and the results just don't seem believable, but I can't find where the problem is.
The problem lies in the fact that the examples are clearly not separable by setting a threshold, as the values for the feature of interest for the positive/negative examples are highly homogeneous. When only a single feature is used for binary classification, RF should only be able to discriminate between examples by setting an absolute threshold for positive/negative identification, right? If that's the case, how can the code below result in perfect performance on the test set?
P.S. In practice I have many more than the ~30 examples shown below, but only included these as an example. Same performance when evaluating >100.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
X_train = np.array([0.427948, 0.165065, 0.31179, 0.645415, 0.125764,
0.448908, 0.417467, 0.524891, 0.038428, 0.441921,
0.927511, 0.556332, 0.243668, 0.565939, 0.265502,
0.122271, 0.275983, 0.60786, 0.670742, 0.565939,
0.117031, 0.117031, 0.001747, 0.148472, 0.038428,
0.50393, 0.49607, 0.148472, 0.275983, 0.191266,
0.254148, 0.430568, 0.198253, 0.323144, 0.29869,
0.344978, 0.524891, 0.323144, 0.344978, 0.28821,
0.441921, 0.127511, 0.31179, 0.254148, 0, 0.001747,
0.243668, 0.281223, 0.281223, 0.427948, 0.548472,
0.927511, 0.417467, 0.282969, 0.367686, 0.198253,
0.572926, 0.29869, 0.570306, 0.183406, 0.310044,
1, 1, 0.60786, 0, 0.282969, 0.349345, 0.521106,
0.430568, 0.127511, 0.50393, 0.367686, 0.310044,
0.556332, 0.670742, 0.30393, 0.548472, 0.193886,
0.349345, 0.122271, 0.193886, 0.265502, 0.537991,
0.165065, 0.191266])
y_train = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,
0, 0, 1, 0, 0, 0, 0])
X_test = np.array((0.572926, 0.521106, 0.49607, 0.570306, 0.645415,
0.125764, 0.448908, 0.30393, 0.183406, 0.537991))
y_test = np.array((1, 1, 1, 0, 0, 0, 1, 1, 0, 0))
# Instantiate model and set parameters
clf = RandomForestClassifier()
clf.set_params(n_estimators=500, criterion='gini', max_features='sqrt')
# Note: reshape is because RF requires column vector format, # but
default NumPy is row
clf.fit(X_train.reshape(-1, 1), y_train)
pred = clf.predict(X_test.reshape(-1, 1))
# sort by feature value for comparison
o = np.argsort(X_test)
print('Example#\tX\t\t\tY_test\tY_true')
for i in o:
print('%d\t\t\t%f\t%d\t%d' % (i, X_test[i], y_test[i], pred[i]))
Which then returns:
Example# X Y_test Y_true
5 0.125764 0 0
8 0.183406 0 0
7 0.303930 1 1
6 0.448908 1 1
2 0.496070 1 1
1 0.521106 1 1
9 0.537991 0 0
3 0.570306 0 0
0 0.572926 1 1
4 0.645415 0 0
How can an RF model with a single feature possibly discriminate these examples? Isn't there something wrong? I've looked into the configuration of the classifier and whatnot and can't find any problems. I was thinking that maybe it was a problem of overfitting (however I'm doing 10-fold cross validation, so that seems less likely), but then I came across this quote on the official webpage for Random Forest classification - ”Random forests does not overfit. You can run as many trees as you want.” (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#remarks)
When only a single feature is used for binary classification, RF should only be able to discriminate between examples by setting an absolute threshold for positive/negative identification, right?
Each branch can discriminate only by one threshold, but each tree is built up by several branches. If the X-space can be split into several intervals such that each interval has the same y-value, then as long as the classifier has enough data to get the boundaries of those intervals, it will be able to predict the test set. However, I noticed that your "test" set seems to be a subset of your train set, which defeats the purpose of having a test set. Of course if you test it on data than you trained on, the accuracy will be high. Try sorting your data by X-value, then taking X-values that aren't in your training set, but are between two adjacent X_train values that have different y-values. For instance, x=.001. You should see accuracy plummet.

Passing array arguments to my own 2D function applied on Pandas groupby

I am given the following pandas dataframe
df
long lat weekday hour
dttm
2015-07-03 00:00:38 1.114318 0.709553 6 0
2015-08-04 00:19:18 0.797157 0.086720 3 0
2015-08-04 00:19:46 0.797157 0.086720 3 0
2015-08-04 13:24:02 0.786688 0.059632 3 13
2015-08-04 13:24:34 0.786688 0.059632 3 13
2015-08-04 18:46:36 0.859795 0.330385 3 18
2015-08-04 18:47:02 0.859795 0.330385 3 18
2015-08-04 19:46:41 0.755008 0.041488 3 19
2015-08-04 19:47:45 0.755008 0.041488 3 19
I also have a function that receives as input 2 arrays:
import pandas as pd
import numpy as np
def time_hist(weekday, hour):
hist_2d=np.histogram2d(weekday,hour, bins = [xrange(0,8), xrange(0,25)])
return hist_2d[0].astype(int)
I wish to apply my 2D function to each and every group of the following groupby:
df.groupby(['long', 'lat'])
I tried passing *args to .apply():
df.groupby(['long', 'lat']).apply(time_hist, [df.weekday, df.hour])
but I get an error: "The dimension of bins must be equal to the dimension of the sample x."
Of course the dimensions mismatch. The whole idea is that I don't know in advance which mini [weekday, hour] arrays to send to each and every group.
How do I do that?
Do:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv', index_col=0)
def time_hist(x):
hour = x.hour
weekday = x.weekday
hist_2d = np.histogram2d(weekday, hour, bins=[xrange(0, 8), xrange(0, 25)])
return hist_2d[0].astype(int)
print(df.groupby(['long', 'lat']).apply(time_hist))
Output:
long lat
0.755008 0.041488 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
0.786688 0.059632 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
0.797157 0.086720 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
0.859795 0.330385 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
1.114318 0.709553 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
dtype: object

Categories

Resources