Related
I am running a ridge regression model over some bootstrapped resampled data, for the sake of this question, let's say two bootstrapped samples
in a list of dataframes. However, when I iterate over the list of dataframes, I get only one output, instead of two outputs corresponding
to each dataframe in the list. Not sure what else I am missing in my code.
Below are the sample datasets
import pandas as pd
import numpy as np
# the resampled datasets
d1 = {'v1': [2.5, 4.5, 3.3, 4.0, 3.8, 2.5, 4.5, 3.3, 4.0, 3.8, 2.5, 4.5, 3.3, 4.0, 3.8, 2.5, 4.5, 3.3, 4.0, 3.8],
'v2': [3.5, 3.8, 2.5, 4.0, 4.0, 3.5, 3.8, 2.5, 4.0, 4.0, 3.5, 3.8, 2.5, 4.0, 4.0, 3.8, 3.89, 2.75, 4.5, 4.25],
'v3': [4.5, 3.8, 3.5, 4.2, 4.3, 1.5, 2.98, 3.5, 3.5, 4.5, 3.8, 3.89, 2.75, 4.5, 4.25, 3.55, 3.85, 2.98, 4.05, 4.50]}
df1 = pd.DataFrame(d1)
d2 = {'v1': [2.6, 4.0, 3.3, 4.0, 3.0, 2.5, 4.5, 3.3, 4.0, 3.8, 4.5, 3.8, 3.5, 4.2, 4.3, 4.25, 3.55, 3.85, 2.98, 4.05],
'v2': [3.8, 3.89, 2.75, 4.5, 4.25, 3.55, 3.85, 2.98, 4.05, 4.50, 3.5, 2.98, 3.5, 3.25, 4.25, 4.0, 4.0, 3.5, 3.8, 2.5],
'v3': [4.0, 3.85, 3.75, 4.0, 4.73, 3.5, 2.98, 3.5, 3.25, 4.25, 3.3, 4.0, 3.8, 2.5, 4.5, 3.3, 4.0, 3.8, 2.5, 4.5]}
df2 = pd.DataFrame(d2)
dflst = [df1, df2]
and the codes I am running on them.
from sklearn.linear_model import Ridge
# function to run ridge regression
def ridgereg(data, ynum=1):
y = np.asarray(data.iloc[:, 0:ynum])
X = np.asarray(data.iloc[:, ynum:])
model = Ridge(alpha=1.0).fit(X,y)
return model.coef_
# iterate over list of dfs
for x in range(1, len(dflst)):
resampled_model = {}
resampled_model[x] = ridgereg(dflst[x], ynum=1)
print(resampled_model)
In the for loop, you are creating a new dict at each iteration, throwing the previously made dict away.
Try (using enumerate):
resampled_model = {} # note that it is outside the loop
for i, df in enumerate(dflst, start=1):
resampled_model[i] = ridgereg(df, ynum=1)
print(resampled_model)
# {1: array([[0.35603345, 0.1373456 ]]), 2: array([[ 0.08019198, -0.10895105]])}
Instead of the for loop, you can use dict comprehension:
resampled_model = {i: ridgereg(df, ynum=1) for i, df in enumerate(dflst, start=1)}
I need to know how following Python psutil module commands outputs are for a multiple CPU (CPU sockets) computer:
import psutil
print(psutil.cpu_percent(interval=0.3, percpu=True))
print(psutil.sensors_temperatures(fahrenheit=False))
print(psutil.sensors_fans())
Note: Python package psutil should be installed.
Note2: Last two commands are not available on Windows. They should be run on Linux.
From psutil documentaion
cpu_percent : 1-or more CPU returns same : 1-d list of cpu_percent value
percpu=False returns int such as 2.3
percpu=True returns list[int] such as
for 1CPU-4cores-8threads [23.8, 5.0, 10.0, 5.0, 15.0, 5.0, 15.0, 23.8]
for 4CPU-4x4cores-4x8threads [23.8, 5.0, 10.0, 5.0, 15.0, 5.0, 15.0, 23.8,23.8, 5.0, 10.0, 5.0, 15.0, 5.0, 15.0, 23.8,23.8, 5.0, 10.0, 5.0, 15.0, 5.0, 15.0, 23.8,23.8, 5.0, 10.0, 5.0, 15.0, 5.0, 15.0, 23.8]
sensor_temparatures returns dict[str,list[namedtuple] such as
{'acpitz' : [shwtemp(label='', current=47.0, high=103.0, critical=103.0)],
'asus' : [shwtemp(label='', current=47.0, high=None, critical=None)],
'coretemp': [shwtemp(label='Physical id 0', current=52.0, high=100.0, critical=100.0),
shwtemp(label='Core 0', current=45.0, high=100.0, critical=100.0),
shwtemp(label='Core 1', current=52.0, high=100.0, critical=100.0),
shwtemp(label='Core 2', current=45.0, high=100.0, critical=100.0),
shwtemp(label='Core 3', current=47.0, high=100.0, critical=100.0)]}
sensors_fans returns dict[str,list[namedtuple] such as
{'asus': [sfan(label='cpu_fan', current=3200)]}
I have an issue with the numpy.array method. I'm trying to set up an array of dimensions (73, 125) with my data, but when applying the .array method I get something like this
set arousal (73,) [list([3.0, 4.0, 4.0, 3.0, 5.0, 3.0, 2.0, 4.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 3.0, 3.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 4.0, 4.0, 5.0, 3.0, 3.0, 3.0, 5.0, 2.0, 3.0, 2.0, 4.0, 3.0, 2.0, 3.0, 2.0, 3.0, 2.0, 2.0, 4.0, 3.0, 4.0, 5.0, 4.0, 3.0, 4.0, 4.0, 4.0, 3.0, 5.0, 3.0, 5.0, 2.0, 3.0, 3.0, 2.0, 3.0, 3.0, 3.0, 3.0, 4.0, 5.0, 5.0, 4.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 1.0, 2.0]) # etc...
While instead I was expecting something like set arousal (73, 125).
This is my code
# Before this I imported the packages, the relevant datasets and did some preprocessing to drop "bad" data
info_en = info_clean[info_clean['QESTN_LANGUAGE'] == 'ENG']
rating_en = rating_clean[rating_clean['LANGUAGE'] == 'ENG']
info_en_set = info_en.copy()
ratings_set = rating_en.copy()
lArousal = []
lValence = []
for case in case_list:
set = ratings_set[ratings_set['CASE'] == case]
lArousal.append(list(set.loc[:,['AROUSAL_RATING']]['AROUSAL_RATING']))
lValence.append(list(set.loc[:,['VALENCE_RATING_RECODED']]['VALENCE_RATING_RECODED']))
arrArousal = np.asarray(lArousal)
arrValence = np.asarray(lValence)
print('set arousal',arrArousal.shape,arrArousal)
print('set valence',arrValence.shape,arrValence)
When I try to train my sklearn classifier I get the error message "setting an array element with a sequence." that I can understand but I can't solve the list issue.
Apparently, the for loop works for one dataset that I am testing but not for the other. In one case I correctly get the 2d array, in the other, I am stuck with this array of lists.
scipy.stats.rv_continuous.fit, allows you to fix parameters when fitting a distribution, but it's dependent on scipy's choice of parametrization. For the gamma distribution is uses the k, theta (shape, scale) parametrization, so it would be easy to fit while holding theta constant, for example. I want to fit to a data set where I know the mean, but the observed mean might vary due to sampling error. This would be easy if scipy used the parametrization that uses mu = k*theta instead of theta. Is there a way to make scipy do this? And if not, is there another library that can?
Here's some example code with a data set has an observed mean of 9.952, but I know the actual mean of the underlying distribution is 11:
from scipy.stats import gamma
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
shape, _, scale = gamma.fit(observations, floc = 0)
print(shape*scale)
and this gives
9.952
but I would like a fit such that shape*scale = 11.0
The fit method of the SciPy distributions provides the maximum likelihood estimate of the parameters. You are correct that it only provides for fitting the shape, location and scale. (Actually, you said shape and scale, but SciPy also includes a location parameter. Sometimes this is called the three parameter gamma distribution.)
For most of the distributions in SciPy, the fit method uses a numerical optimizer to minimize the negative log-likelihood, as defined in the nnlf method. Instead of using the fit method, you could do this yourself with just a couple lines of code. This allows you to create an objective function with just one parameter, say the shape k, and within that function, set theta = mean/k, where mean is the desired mean, and call gamma.nnlf to evaluate the negative log-likelihood. Here's one way you could do it:
import numpy as np
from scipy.stats import gamma
from scipy.optimize import fmin
def nll(k, mean, x):
return gamma.nnlf(np.array([k[0], 0, mean/k[0]]), x)
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
# This is the desired mean of the distribution.
mean = 11
# Initial guess for the shape parameter.
k0 = 3.0
opt = fmin(nll, k0, args=(mean, np.array(observations)),
xtol=1e-11, disp=False)
k_opt = opt[0]
theta_opt = mean / k_opt
print(f"k_opt: {k_opt:9.7f}")
print(f"theta_opt: {theta_opt:9.7f}")
This script prints
k_opt: 1.9712604
theta_opt: 5.5801861
Alternatively, one can modify the first order conditions for the extremum of the log-likelihood shown in wikipedia so that there is just one parameter, k. Then the condition for the extreme value can be implemented as a scalar equation whose root can be found with, say, scipy.optimize.fsolve. The following is a variation of the above script that uses this technique.
import numpy as np
from scipy.special import digamma
from scipy.optimize import fsolve
def first_order_eq(k, mean, x):
mean_logx = np.mean(np.log(x))
return (np.log(k) - digamma(k) + mean_logx - np.mean(x)/mean
- np.log(mean) + 1)
observations = [17.6, 24.9, 3.9, 17.6, 11.8, 10.4, 4.1, 11.7, 5.7, 1.6,
8.6, 12.9, 5.7, 8.0, 7.4, 1.2, 11.3, 10.4, 1.0, 1.9,
6.0, 9.3, 13.3, 5.4, 9.1, 4.0, 12.8, 11.1, 23.1, 4.2,
7.9, 11.1, 10.0, 3.4, 27.8, 7.2, 14.9, 2.9, 5.5, 7.0,
3.9, 12.3, 10.6, 22.1, 5.0, 4.1, 21.3, 15.9, 34.5, 8.1,
19.6, 10.8, 13.4, 22.8, 27.6, 6.8, 5.9, 9.0, 7.1, 21.2,
1.0, 14.6, 16.9, 1.0, 6.5, 2.9, 7.1, 14.1, 15.2, 7.8,
9.0, 4.9, 2.1, 9.5, 5.6, 11.1, 7.7, 18.3, 3.8, 11.0,
4.2, 12.5, 8.4, 3.2, 4.0, 3.8, 2.0, 24.7, 24.6, 3.4,
4.3, 3.2, 7.6, 8.3, 14.5, 8.3, 8.4, 14.0, 1.0, 9.0]
# This is the desired mean of the distribution.
mean = 11
# Initial guess for the shape parameter.
k0 = 3
sol = fsolve(first_order_eq, k0, args=(mean, observations),
xtol=1e-11)
k_opt = sol[0]
theta_opt = mean / k_opt
print(f"k_opt: {k_opt:9.7f}")
print(f"theta_opt: {theta_opt:9.7f}")
Output:
k_opt: 1.9712604
theta_opt: 5.5801861
I have a default example dictionary which looks like this:
critics = {'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 3.5},
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
'The Night Listener': 4.5, 'Superman Returns': 4.0,
'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 2.0},
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}
I use a function that returns the most similar person in the dictionary using the Pearson correlation coefficient which looks like this:
from math import sqrt
def sim_pearson(prefs,p1,p2):
# lista na zaednichki tochki
si={}
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
# najdi go brojot na elementi
n=len(si)
# ako nemaat zaednichki tochki vrati 0
if n==0: return 0
# dodadi gi site
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
# sumiraj gi kvadratite
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
# sumiraj gi proizvodite
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
# presmetka na Pirsonoviot koeficient
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
r=num/den
return r
and it works. For example, for the call print sim_pearson(critics, 'Toby', 'Lisa Rose') I get the coefficient 0.991240707162.
However, when I try the same function with my dictionary which is:
tests = {'dzam': {'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiKAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjvAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj3AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiMAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiBAgw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjtAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj_AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiIAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj9AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiqAgw': 3.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjzAQw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxikAgw': 3.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiaAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj1AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjxAQw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiYAgw': 5.0},
'kex': {'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiKAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjvAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj3AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiMAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiBAgw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjtAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj_AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiIAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj9AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiqAgw': 3.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjzAQw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxikAgw': 3.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiaAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj1AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjxAQw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiYAgw': 5.0},
'rokoko': {'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiKAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjvAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj3AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiMAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiBAgw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjtAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj_AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiIAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj9AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiqAgw': 3.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjzAQw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxikAgw': 3.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiaAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj1AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjxAQw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiYAgw': 5.0},
'test#example.com': {'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiKAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjvAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj3AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiMAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiBAgw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjtAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj_AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiIAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj9AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiqAgw': 3.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjzAQw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxikAgw': 3.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiaAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxj1AQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjxAQw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiYAgw': 5.0},
'seljak': {'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiKAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjvAQw': 1.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxiKAgw': 5.0,
'ag1yYW5kb20tcmFuZG9tcg8LEghib29rbWFyaxjvAQw': 1.0, }}
I always get 1.0, no matter that I have matches in the dictionaries, why is that so?
By the way, I'm using hashes so my dictionary MUST have this long strings. :)
You are probably fooled by the long keys that hide to the eyes which strings are different.
Try setting all the values to 0 in test 'seljak' and run a correlation with it. You'll see a 0 correlation:
print sim_pearson(tests, 'test#example.com', 'seljak')
Change the last value of test 'seljak' to 1 and you will see a negative correlation re-running the script.