Multivariate Normal changes value of variable in PyMC - python

I might be doing something wrong but I can't figure out what it is. I'm trying to reproduce some results from a real state dataset from Baton Rouge, LA. The original code is written in WinBUGS here. There are some minor differences between the dataset used in the link above and the one I'm using right now. However, I think that is not significant. This is the code:
import pymc as pm, pandas as pd, numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import inv
# Loading dataset
df = pd.read_table('http://pastebin.com/raw.php?i=41us4HVj', sep=' ')
# Setting priors
beta = pm.Normal('beta', 0.0, 0.1, size=3)
mu = pm.Lambda('mu', lambda b=beta:
b[0]+b[1]*df['LivingArea']/1000.0+b[2]*df['Age'])
tau = pm.Gamma('tau', 0.1, 0.1)
phi = pm.Uniform('phi', 0.1, 10)
# Trying to build a covariate matrix
A = squareform(pdist(np.array(zip(df['Latitude'], df['Longitude']))))
# Using the powered exponential to obtain a precision matrix
precision = pm.Lambda('exp', lambda u=A, tau=tau, phi=phi, kappa=1:
inv((1/tau)*np.exp(-(phi*u)**kappa)))
If I inspect the value of mu, I get this:
mu.value
Out[2]:
0 24.568272
1 2.909063
2 -2.778916
3 28.206696
4 -0.270571
5 -2.865153
6 14.158162
7 31.466438
8 44.681351
9 22.191397
10 -6.412350
11 11.709424
12 25.453254
13 24.366674
14 34.711048
...
55 24.625763
56 21.763089
57 65.108136
58 15.428714
59 20.992329
60 36.384037
61 16.730507
62 23.021763
63 54.887747
64 30.612696
65 52.685840
66 59.612372
67 18.822422
68 18.940658
69 72.678188
Length: 70, dtype: float64
However, after running MvNormal, the value of mu is changed:
w = pm.MvNormal('w', mu, precision)
mu.value
Out[4]:
0 -107.913779
1 -1.243466
2 8.283926
3 26.412651
4 1.806728
5 -1.300734
6 -80.657396
7 71.614343
8 -3.817774
9 -10.283683
10 -3.804962
11 8.639403
12 18.927553
13 -10.004095
14 -37.431770
...
55 88.612179
56 18.011459
57 -7.421157
58 7.974531
59 -3.697444
60 -17.520367
61 36.453531
62 -39.235745
63 -6.701737
64 68.672902
65 -44.040923
66 11.826075
67 -21.995198
68 -15.886362
69 4.653335
Length: 70, dtype: float64
By the way, this only happens to mu. The precision variable remains the same.
Did I make a mistake?
UPDATE:
Already filed an issue on GitHub. After further inspection, the culprit seems to be the pd.Series object that is used in the mu variable. If I convert or remove the Series, mu won't change after calling MvNormal.
Thanks!

Related

Numpy Vectorized Window Operations

I'm interested in figuring out how to do vectorized computations in a numpy array / pandas dataframe where each new cell is updated with local information.
For example, lets say I'm a weatherman interested in making predictions about the weather. My prediction algorithm will be the mean of the past 3 days. While this prediction is simple, I'd like to be able to do this with an arbitrary function.
Example data:
day temp
1 70
2 72
3 68
4 67
...
After a transformation should become
day temp prediction
1 70 None (no previous data)
2 72 70 (only one data point)
3 68 71 (two data points)
4 67 70
5 70 69
...
I'm only interested in the prediction column, so no need to make an attempt to join the data back together after achieving the prediction! Thanks!
Use rolling with a window of 3 and the min_periods of 1
df['prediction'] = df['temp'].rolling(window = 3, min_periods = 1).mean().shift()
df
day temp prediction
0 1 70 NaN
1 2 72 70
2 3 68 71
3 4 67 70
4 5 70 69

Multiplying an entire df or matrix by 1000?

I am new to R and Python, so forgive me if this is an elementary question. I have a large data set of genes (columns) by patients (rows), with each value being an RNA expression value (most values falling between 0 and 1). I want to multiply the entire data set by 1000 so that all non-zero values will be >1.
Currently:
Pt GeneA GeneB GeneC
1 0.001 2 0
2 0 0.5 0.002
Would like:
Pt GeneA GeneB GeneC
1 1 2000 0
2 0 500 2
I have tried to do this in both R and Python and am running into issues with both. I have also tried converting my data between data frame and matrix, and it won't work with either. I have searched extensively on this website and find information about how to multiply an entire df/matrix by a vector, or individual columns by a scalar, but not the entire thing. Could someone kindly point me in the right direction? I feel like it can't possibly be this hard :)
Using R:
df <- read.csv("/Users/m/Desktop/data.csv")
df * 100
In Ops.factor(left, right) : ‘*’ not meaningful for factors
mtx <- as.matrix(df)
mtx * 100
Error in mtx * 100 : non-numeric argument to binary operator
Using Python 3.7.6:
df = df * 1000
^ This runs without an error message but the values in the cells are exactly the same, so it didn't actually multiply anything...
df = df.div(.001)
TypeError: unsupported operand type(s) for /: 'str' and 'float'
Any creative ideas or resources to point me in the right direction? Thank you!
What does str(df) give you? At least some of your columns have been converted to factors because they are character strings. Open the csv file in a text editor and make sure the numbers are not surrounded by "" or that missing values have been labeled with a character. Once you have the data read properly it will be simple:
set.seed(42)
dat <- data.frame(matrix(sample.int(100, 100, replace=TRUE), 10, 10))
str(dat)
# 'data.frame': 10 obs. of 10 variables:
# $ X1 : int 49 65 25 74 100 18 49 47 24 71
# $ X2 : int 100 89 37 20 26 3 41 89 27 36
# $ X3 : int 95 5 84 34 92 3 58 97 42 24
# $ X4 : int 30 43 15 22 58 8 36 68 86 18
# $ X5 : int 92 69 4 98 50 99 88 87 49 26
# $ X6 : int 6 6 2 3 21 2 58 10 40 5
# $ X7 : int 33 49 100 73 29 76 84 9 35 93
# $ X8 : int 16 92 69 92 2 82 24 18 69 55
# $ X9 : int 40 21 100 57 100 42 18 91 13 53
# $ X10: int 54 83 32 80 60 29 81 73 85 43
dat1000 <- dat * 1000
Try this option:
df[,c(2:ncol(df)] <- 1000*df[,c(2:ncol(df)]
If you instead wanted a perhaps more generic solution targeting only columns whose name starts with Gene, then use:
df[grep("^Gene", names(df))] <- 1000*df[grep("^Gene", names(df))]
Looking at your target result, you need to multiply all columns except pt. In python:
target_cols = [i for i in df.columns if i!='Pt']
for i in target_cols:
df[i] = df[i].astype(float)
df[i] = df[i]*1000

How to create contours over points with Basemap?

Having a table "tempcc" of value with x,y geografic coords (don't know attaching files here, there is 86 rows in my csv):
X Y Temp
0 35.268 55.618 1.065389
1 35.230 55.682 1.119160
2 35.508 55.690 1.026214
3 35.482 55.652 1.007834
4 35.289 55.664 1.087598
5 35.239 55.655 1.099459
6 35.345 55.662 1.066117
7 35.402 55.649 1.035958
8 35.506 55.643 0.991939
9 35.526 55.688 1.018137
10 35.541 55.695 1.017870
11 35.471 55.682 1.033929
12 35.573 55.668 0.985559
13 35.547 55.651 0.982335
14 35.425 55.671 1.042975
15 35.505 55.675 1.016236
16 35.600 55.681 0.985532
17 35.458 55.717 1.063691
18 35.538 55.720 1.037523
19 35.230 55.726 1.146047
20 35.606 55.707 1.003364
21 35.582 55.700 1.006711
22 35.350 55.696 1.087173
23 35.309 55.677 1.088988
24 35.563 55.687 1.003785
25 35.510 55.764 1.079220
26 35.334 55.736 1.119026
27 35.429 55.745 1.093300
28 35.366 55.752 1.119061
29 35.501 55.745 1.068676
.. ... ... ...
56 35.472 55.800 1.117183
57 35.538 55.855 1.134721
58 35.507 55.834 1.129712
59 35.256 55.845 1.211969
60 35.338 55.823 1.174397
61 35.404 55.835 1.162387
62 35.460 55.826 1.138965
63 35.497 55.831 1.130774
64 35.469 55.844 1.148516
65 35.371 55.510 0.945187
66 35.378 55.545 0.969400
67 35.456 55.502 0.902285
68 35.429 55.517 0.925932
69 35.367 55.710 1.090652
70 35.431 55.490 0.903296
71 35.284 55.606 1.051335
72 35.234 55.634 1.088135
73 35.284 55.591 1.041181
74 35.354 55.587 1.010446
75 35.332 55.581 1.015004
76 35.356 55.606 1.023234
77 35.311 55.545 0.997468
78 35.307 55.575 1.020845
79 35.363 55.645 1.047831
80 35.401 55.628 1.021373
81 35.340 55.629 1.045491
82 35.440 55.643 1.017227
83 35.293 55.630 1.063910
84 35.370 55.623 1.029797
85 35.238 55.601 1.065699
I try to create isolines with:
from numpy import meshgrid,linspace
data=tempcc
m = Basemap(lat_0 = np.mean(tempcc['Y'].values),\
lon_0 = np.mean(tempcc['X'].values),\
llcrnrlon=35,llcrnrlat=55.3, \
urcrnrlon=35.9, urcrnrlat=56.0, resolution='l')
x = linspace(m.llcrnrlon, m.urcrnrlon, data.shape[1])
y = linspace(m.llcrnrlat, m.urcrnrlat, data.shape[0])
xx, yy = meshgrid(x, y)
m.contour(xx, yy, data,latlon=True)
#pt.legend()
m.scatter(tempcc['X'].values, tempcc['Y'].values, latlon=True)
#m.contour(x,y,data,latlon=True)
But I can't manage correctly, although everything seems to be fine. As far as I understand I have to make a 2D matrix of values, where i is lat, and j is lon, but I can't find the example.
The result I get
as you see, region is correct, but interpolation is not good.
What's the matter? Which parameter have I forgotten?
You could use a Triangulation and then call tricontour() instead of contour()
import matplotlib.pyplot as plt
from matplotlib.tri import Triangulation
from mpl_toolkits.basemap import Basemap
import numpy
m = Basemap(lat_0 = np.mean(tempcc['Y'].values),
lon_0 = np.mean(tempcc['X'].values),
llcrnrlon=35,llcrnrlat=55.3,
urcrnrlon=35.9, urcrnrlat=56.0, resolution='l')
triMesh = Triangulation(tempcc['X'].values, tempcc['Y'].values)
tctr = m.tricontour(triMesh, tempcc['Temp'].values,
levels=numpy.linspace(min(tempcc['Temp'].values),
max(tempcc['Temp'].values), 7),
latlon=True)

TypeError: '<' not supported between instances of 'str' and 'int' while doing PCA for k-means clustering

I am trying to apply Kernel Principle Component Analysis on a dataset without a dependent variable to do a cluster analysis with k-means, so that I can learn how to do so. Here is a sample of my dataset(according to the scenario, this is a dataset of a shopping mall, and the shopping mall wants to discover the segments of its customers according to the data below):
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76
7 Female 35 18 6
8 Female 23 18 94
9 Male 64 19 3
10 Female 30 19 72
11 Male 67 19 14
First, I omitted CustomerID column and then encoded the gender column to be able to apply kernel PCA. Here is how I did it:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the mall dataset with pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, 1:5].values
df = pd.DataFrame(X)
#df is in order to visualize the "X" on variable explorer
#Encoding independent categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
After executing this code, I could get the array with float64 Type. The sample from the array I created is below:
0 1 19 15 39
0 1 21 15 81
1 0 20 16 6
1 0 23 16 77
1 0 31 17 40
1 0 22 17 76
1 0 35 18 6
1 0 23 18 94
0 1 64 19 3
1 0 30 19 72
0 1 67 19 14
And then, I wanted to apply Kernel PCA to get the principal components which I will use at k-means. However, when I try to execute the code below, I get the error "TypeError: '<' not supported between instances of 'str' and 'int'".
# Applying Kernel PCA
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 'None', kernel = 'rbf')
X = kpca.fit_transform(X)
explained_variance = kpca.explained_variance_ratio_
Even if I encoded my categorical data and I don't have any strings in my dataset, I cannot understand why it gives this error. Is there anyone that could help?
Thank you very much in advance.
n_components = 'None' is the problem. you should not put a string here...
use:
kpca = KernelPCA(n_components = None, kernel = 'rbf')
I suspect this is what is happening:
This is an error of an included file, or some code that is running, prior to your running code. The "TypeError: '<' to which this is referring is a string "<error>". Which is what something prior to your code is returning.

seek a better design suggestion for a trial-and-error mechanism in python?

See below data matrix get from sensors, just INT numbers, nothing specical.
A B C D E F G H I J K
1 25 0 25 66 41 47 40 12 69 76 1
2 17 23 73 97 99 39 84 26 0 44 45
3 34 15 55 4 77 2 96 92 22 18 71
4 85 4 71 99 66 42 28 41 27 39 75
5 65 27 28 95 82 56 23 44 97 42 38
…
10 95 13 4 10 50 78 4 52 51 86 20
11 71 12 32 9 2 41 41 23 31 70
12 54 31 68 78 55 19 56 99 67 34 94
13 47 68 79 66 10 23 67 42 16 11 96
14 25 12 88 45 71 87 53 21 96 34 41
The horizontal A to K is the sensor name, and vertical is the data from sensor by the timer manner.
Now I want to analysis those data with trial-and-error methods, I defined some concepts to explain what I want:
o source
source is all the raw data I get
o entry
a entry is a set of all A to K sensor, take the vertical 1st row for example: the entry is
25 0 25 66 41 47 40 12 69 76 1
o rules
a rule is a "suppose" function with assert value return, so far just "true" or "false".
For example, I suppose the sensor A, E and F value will never be same in one enrty, if one entry with A=E=F, it will tigger violation and this rule function will return false.
o range:
a range is function for selecting vertical entry, for example, the first 5 entries
Then, the basic idea is:
o source + range = subsource(s)
o subsource + rules = valiation(s)
The finally I want to get a list may looks like this:
rangeID ruleID violation
1 1 Y
2 1 N
3 1 Y
1 2 N
2 2 N
3 2 Y
1 3 N
2 3 Y
3 3 Y
But the problem is the rule and range I defined here will getting very complicated soon if you looks deeper, they have too much possible combinations, take "A=E=F" for example, one can define "B=E=F","C=E=F","C>F" ......
So soon I need a rule/range generator which may accept those "core parameters" such as "A=E=F" as input parameter even using regex string later. That is too complicated just defeated me, leave alone I may need to persistence rules unique ID, data storage problem, rules self nest combination problem ......
So my questions are:
Anyone knows if there's some module/soft fit for this kind of trial-and-error calculation or the rules defination I want?
Anyone can share me a better rules/range design I described?
Thanks for any hints.
Rgs,
KC
If I understand what you're asking correctly, I probably wouldn't even venture down the Numbpy path as I don't think given your description that it's really required. Here's a sample implementation of how I might go about solving the specific issue that you presented:
l = [\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1},\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1}\
]
r = ['a=g=i', 'a=b', 'a=c']
res = []
# test all given rules
for n in range(0, len(r)):
# i'm assuming equality here - you'd have to change this to accept other operators if needed
c = r[n].split('=')
vals = []
# build up a list of values given our current rule
for e in c:
vals.append(l[0][e])
# using len(set(v)) gives us the number of distinct values
res.append({'rangeID': 0, 'ruleID':n, 'violation':'Y' if len(set(vals)) == 1 else 'N'})
print res
Output:
[{'violation': 'N', 'ruleID': 0, 'rangeID': 0}, {'violation': 'N', 'ruleID': 1, 'rangeID': 0}, {'violation': 'Y', 'ruleID': 2, 'rangeID': 0}]
http://ideone.com/zbTZr
There are a few assumptions made here (such as equality being the only operator in use in your rules) and some functionality left out (such as parsing your input to the list of dicts I used, but I'm hopeful that you can figure that out on your own.
Of course, there could be a Numpy-based solution that's simpler than this that I'm just not thinking of at the moment (it's late and I'm going to bed now ;)), but hopefully this helps you out anyway.
Edit:
Woops, missed something else (forgot to add it in prior to posting) - I only test the first element in l (the given range).. You'd just want to stick that in another for loop rather than using that hard-coded 0 index.
You want to look at Numpy matrix for data structures like matrix etc. It exposes a list of functions that work on matrix manipulation.
As for rule / range generator I am afraid you will have to build your own domain specific language to achieve that.

Categories

Resources