Different outputs for byte objects with escape sequences (Python Pandas Msgpack) - python

Python represents escape sequences with \ as I understand. So if I tryo to insert a single backslash into a string, I get the string variable with double backslashes as below:
x = '/x91/x84/xa4/x74'
b = x.replace(r'/', '\\')
>>> b
'\\x91\\x84\\xa4\\x74'
But then If I have two bytes objects - one with single backslash and another with double backslashes, and give them each to pandas.read_msgpack() function, why does it give different outputs in each case? Please see what I have tried below:
byte_obj1 = b'\x91\x84\xa4\x74\x69\x6d\x65\x92\xcb\x41\xdd\xcd\x65\x00\x00\x00\x00\xcb\x41\xdd\xcd\x65\x00\x00\xa3\xd7\xa4\x76\x61\x72\x30\x92\xcb\x40\x49\x0c\xcc\xcc\xcc\xcc\xcd\xcb\x40\x49\x0c\xcc\xcc\xcc\xcc\xcd\xa4\x76\x61\x72\x31\x92\xcb\xff\xf8\x00\x00\x00\x00\x00\x00\xcb\x40\x4e\x0c\xcc\xcc\xcc\xcc\xcd\xa4\x76\x61\x72\x32\x92\xcb\xff\xf8\x00\x00\x00\x00\x00\x00\xcb\xff\xf8\x00\x00\x00\x00\x00\x00'
d1=pandas.read_msgpack(byte_obj1)
>>> d1
({'time': (2000000000.0, 2000000000.01), 'var0': (50.1, 50.1), 'var1': (nan, 60.1), 'var2': (nan, nan)},)
byte_obj2=
b'\\x91\\x84\\xa4\\x74\\x69\\x6d\\x65\\x92\\xcb\\x41\\xdd\\xcd\\x65\\x00\\x00\\x00\\x00\\xcb\\x41\\xdd\\xcd\\x65\\x00\\x00\\xa3\\xd7\\xa4\\x76\\x61\\x72\\x30\\x92\\xcb\\x40\\x49\\x0c\\xcc\\xcc\\xcc\\xcc\\xcd\\xcb\\x40\\x49\\x0c\\xcc\\xcc\\xcc\\xcc\\xcd\\xa4\\x76\\x61\\x72\\x31\\x92\\xcb\\xff\\xf8\\x00\\x00\\x00\\x00\\x00\\x00\\xcb\\x40\\x4e\\x0c\\xcc\\xcc\\xcc\\xcc\\xcd\\xa4\\x76\\x61\\x72\\x32\\x92\\xcb\\xff\\xf8\\x00\\x00\\x00\\x00\\x00\\x00\\xcb\\xff\\xf8\\x00\\x00\\x00\\x00\\x00\\x00'
d2=pandas.read_msgpack(byte_obj2)
>>> d2
[92, 120, 57, 49, 92, 120, 56, 52, 92, 120, 97, 52, 92, 120, 55, 52, 92, 120, 54, 57, 92, 120, 54, 100, 92, 120, 54, 53, 92, 120, 57, 50, 92, 120, 99, 98, 92, 120, 52, 49, 92, 120, 100, 100, 92, 120, 99, 100, 92, 120, 54, 53, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 99, 98, 92, 120, 52, 49, 92, 120, 100, 100, 92, 120, 99, 100, 92, 120, 54, 53, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 97, 51, 92, 120, 100, 55, 92, 120, 97, 52, 92, 120, 55, 54, 92, 120, 54, 49, 92, 120, 55, 50, 92, 120, 51, 48, 92, 120, 57, 50, 92, 120, 99, 98, 92, 120, 52, 48, 92, 120, 52, 57, 92, 120, 48, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 100, 92, 120, 99, 98, 92, 120, 52, 48, 92, 120, 52, 57, 92, 120, 48, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 100, 92, 120, 97, 52, 92, 120, 55, 54, 92, 120, 54, 49, 92, 120, 55, 50, 92, 120, 51, 49, 92, 120, 57, 50, 92, 120, 99, 98, 92, 120, 102, 102, 92, 120, 102, 56, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 99, 98, 92, 120, 52, 48, 92, 120, 52, 101, 92, 120, 48, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 99, 92, 120, 99, 100, 92, 120, 97, 52, 92, 120, 55, 54, 92, 120, 54, 49, 92, 120, 55, 50, 92, 120, 51, 50, 92, 120, 57, 50, 92, 120, 99, 98, 92, 120, 102, 102, 92, 120, 102, 56, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 99, 98, 92, 120, 102, 102, 92, 120, 102, 56, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48, 92, 120, 48, 48]
Why does Python not consider double backslahes and '\' same as in case of escape
sequence? Could someone please help me in this dilemma. Thank you very much in advance.

In your initial setting, you wrote x = '/x91/x84/xa4/x74'. These are forward slashes and not backward slashes. Backward slashes in python are escape characters, so the first backslash in a double backslash functions as an escape character for the second backslash.

Related

How to convert an array into a new array based on a lookup dictionary

I'm trying to convert a numpy array into a new array by using each value in the existing array and finding its corresponding key from a dictionary. The new array should consist of the corresponding dictionary keys.
Here is what I have:
# dictionary where values are lists
available_weights = {0.009174311926605505: [7, 14, 21, 25, 31, 32, 35, 45, 52, 82, 83, 96, 112, 119, 142], 0.009523809523809525: [33, 37, 43, 44, 69, 73, 75, 78, 79, 80, 102, 104, 110, 115, 150], 0.1111111111111111: [91], 0.019230769230769232: [36, 50, 127, 139], 0.010869565217391304: [10, 48, 55, 62, 77, 88, 103, 124, 131, 137, 147], 0.014084507042253521: [2, 3, 4, 22, 27, 30, 41, 53, 87, 122, 123, 132, 143], 0.011494252873563218: [20, 34, 99, 125, 135, 138, 141], 0.045454545454545456: [0, 109], 0.01818181818181818: [49, 64, 72, 90, 146, 148], 0.07142857142857142: [106], 0.01282051282051282: [16, 63, 68, 98, 114, 130, 145], 0.010638297872340425: [8, 28, 40, 57, 61, 66, 71, 74, 76, 84, 85, 86, 128, 144], 0.02040816326530612: [6, 65], 0.021739130434782608: [29, 67, 92, 93], 0.02127659574468085: [47, 118, 120], 0.011111111111111112: [1, 13, 19, 24, 42, 54, 70, 89, 94, 107, 117, 126, 129, 140], 0.015625: [38, 60, 101, 133, 134, 136], 0.03333333333333333: [56, 58, 97, 121], 0.016666666666666666: [5, 26, 105, 113], 0.014705882352941176: [17, 46, 95]}
# existing numpy array
train_idx = [134, 45, 137, 140, 79, 98, 128, 80, 99, 71, 145, 35, 94, 122, 77, 23, 113, 44, 68, 21, 20, 125, 74, 139, 29, 109, 25, 34, 6, 81, 22, 114, 12, 95, 150, 106, 84, 19, 58, 59, 88, 143, 136, 43, 72, 132, 117, 13, 65, 111, 39, 14, 56, 11, 26, 90, 119, 112, 27, 57, 46, 147, 123, 16, 36, 100, 141, 38, 62, 32, 75, 146, 89, 37, 31, 40, 64, 87, 3, 103, 102, 104, 78, 53, 1, 142, 47, 130, 105, 4, 93, 52, 42, 10, 9, 115, 76, 54, 49, 116, 69, 5, 86, 66, 101, 107, 96, 110, 8, 73, 121, 138, 67, 124, 108, 97, 120, 2, 148, 127, 135, 18, 149, 82, 41, 144, 129, 118, 51, 126, 33, 85, 24, 0, 61, 92, 70, 15, 17, 50, 83, 30, 28, 91, 60, 48, 133, 55, 63, 7, 131]
So I want to use each value in train_idx to find the corresponding dictionary key in available_weights. The expected output should look like this (with a length of all 150 values):
new_array = [0.015625, 0.009174311926605505, 0.010869565217391304, ... ,0.01282051282051282, 0.009174311926605505, 0.010869565217391304]
Any help would be appreciated!
result = []
flipped = dict()
for value in train_idx:
flipped[value] = []
for key in available_weights:
if value in available_weights[key]:
flipped[value].append(key)
result.append(key)

how to generate combinations from multiple variables?

I have following variables;
D8 =[22, 27, 28, 30, 31, 40, 41, 42, 43, 45]
D9 = [79, 80, 90, 92, 93, 97, 98, 104, 105, 109]
D10=[61, 64, 66, 70, 72, 76, 81, 86, 87]
By using above variables, I tried to generate all the possible combinations as follows;
import itertools
stuff = [D8, D9, D10]
for L in range(0, len(stuff)+1):
for subset in itertools.combinations(stuff, L):
print(subset)
The result depicts as follows;
Output>>>
()
([22, 27, 28, 30, 31, 40, 41, 42, 43, 45],)
([79, 80, 90, 92, 93, 97, 98, 104, 105, 109],)
([61, 64, 66, 70, 72, 76, 81, 86, 87],)
([22, 27, 28, 30, 31, 40, 41, 42, 43, 45], [79, 80, 90, 92, 93, 97, 98, 104, 105, 109])
([22, 27, 28, 30, 31, 40, 41, 42, 43, 45], [61, 64, 66, 70, 72, 76, 81, 86, 87])
([79, 80, 90, 92, 93, 97, 98, 104, 105, 109], [61, 64, 66, 70, 72, 76, 81, 86, 87])
([22, 27, 28, 30, 31, 40, 41, 42, 43, 45], [79, 80, 90, 92, 93, 97, 98, 104, 105, 109], [61, 64, 66, 70, 72, 76, 81, 86, 87])
I want to know is there any other cleaner method to generate the above result, because in the current output I cannot flat each result into each individual outcome? Expected result should be like this?
Expected Result>>>
([])
([22, 27, 28, 30, 31, 40, 41, 42, 43, 45])
([79, 80, 90, 92, 93, 97, 98, 104, 105, 109])
([61, 64, 66, 70, 72, 76, 81, 86, 87])
([22, 27, 28, 30, 31, 40, 41, 42, 43, 45, 79, 80, 90, 92, 93, 97, 98, 104, 105, 109])
([22, 27, 28, 30, 31, 40, 41, 42, 43, 45, 61, 64, 66, 70, 72, 76, 81, 86, 87])
([79, 80, 90, 92, 93, 97, 98, 104, 105, 109, 61, 64, 66, 70, 72, 76, 81, 86, 87])
([22, 27, 28, 30, 31, 40, 41, 42, 43, 45, 79, 80, 90, 92, 93, 97, 98, 104, 105, 109, 61, 64, 66, 70, 72, 76, 81, 86, 87])
Thank you in advanced!!!
The itertools documentation contains the recipe flatten:
def flatten(list_of_lists):
"Flatten one level of nesting"
return chain.from_iterable(list_of_lists)
which you need to apply twice to get the desired result:
from itertools import combinations, chain
D8 = [22, 27, 28, 30, 31, 40, 41, 42, 43, 45]
D9 = [79, 80, 90, 92, 93, 97, 98, 104, 105, 109]
D10 = [61, 64, 66, 70, 72, 76, 81, 86, 87]
stuff = [D8, D9, D10]
def flatten(list_of_lists):
"""Flatten one level of nesting"""
return chain.from_iterable(list_of_lists)
result = flatten(map(flatten, combinations(stuff, length)) for length in range(len(stuff) + 1))
for xs in result:
print(list(xs))
producing:
[]
[22, 27, 28, 30, 31, 40, 41, 42, 43, 45]
[79, 80, 90, 92, 93, 97, 98, 104, 105, 109]
[61, 64, 66, 70, 72, 76, 81, 86, 87]
[22, 27, 28, 30, 31, 40, 41, 42, 43, 45, 79, 80, 90, 92, 93, 97, 98, 104, 105, 109]
[22, 27, 28, 30, 31, 40, 41, 42, 43, 45, 61, 64, 66, 70, 72, 76, 81, 86, 87]
[79, 80, 90, 92, 93, 97, 98, 104, 105, 109, 61, 64, 66, 70, 72, 76, 81, 86, 87]
[22, 27, 28, 30, 31, 40, 41, 42, 43, 45, 79, 80, 90, 92, 93, 97, 98, 104, 105, 109, 61, 64, 66, 70, 72, 76, 81, 86, 87]

Clark evans aggregation index in Python

The Clark-Evans index is one of the most basic statistics to measure point aggregation in spatial analysis. However, I can't find any implementation in Python. So I adapted the R code from the hyperlink above. I want to ask if the statistic and p-value are correct with such irregular study areas:
Function
import numpy as np
import os, math
from shapely.geometry import Polygon, Point
from sklearn.neighbors import KDTree
from statistics import NormalDist
def clarkEvans (X, Y, roi):
""" Clark evans index takes point x,y coordinates and a polygon for cell shape (roi) and outputs Clark-Evans index:
R ~1 suggests spatial randomness, while R<<1 suggests clustering and R>>1 suggests ordering"""
# Import cell boundries from roi file
pgon = Polygon(roi)
# Calculate intensity from points/area
areaW = pgon.area
npts = len(X)
intensity = npts/areaW
if npts <2:
return(np.nan)
tmp_df = list(zip(X, Y))
# Get nearest neighbours for each observation
kdt = KDTree(tmp_df, leaf_size=30, metric='euclidean') # This is very good for large datasets, but maybe bad for small ones?
dists, ids = kdt.query(tmp_df, k=2)
dists = [x[1] for x in dists]
# Clark-Evans Index (mean NN distances/mean NN distances under poisson)
Dobs = np.mean(dists)
Dpoison = 1/(2 * math.sqrt(intensity))
Rnaive = Dobs/Dpoison
# Calculate p-value under normal distribution
SE = math.sqrt(((4 - math.pi) * areaW)/(4 * math.pi))/npts
Z = (Dobs - Dpoison)/SE
# Diff between observed and expected NN distances should have Normal Distribution according to Central Limit Theorem (CLT)
p_val = NormalDist().pdf(Z) # p_val for clustering
# Return the ClarkEvans Index and p-value
return(round(Rnaive,3), round(p_val, 3))
Output
In the image is my Clark-Evans Index being applied to and plotted with two different datasets. The index is similar for both patterns, one of which seems more obviously clustered. The p-values seem switched, I would think the second plot would have a significant p-val, being the clustered one.
Input data
# The xy coordinates of observations plus the point vertices of the study area (roi)
x1 = [123, 105, 71, 109, 96, 49, 86, 80, 120, 98, 59, 100, 118, 69, 84, 21, 95, 77, 158, 118, 87, 77, 87, 77, 82, 106, 120, 125, 61, 24, 53, 106, 52, 103, 89, 99, 111, 58, 97, 83, 51, 45, 64, 112, 114, 73, 55, 111, 110, 102, 116, 107, 84, 97, 118, 96, 116, 45, 102, 145, 126, 50, 103, 98, 20, 79, 113, 99, 90, 143, 36, 120, 106, 91, 95, 15, 122, 69, 28, 71, 66, 119, 78, 75, 113, 44, 85, 60, 88, 68, 116, 40, 59, 105, 65, 94, 79, 95, 120, 67, 78, 59, 89, 84, 111, 78, 72, 156, 162, 134, 157, 120, 126, 86, 58, 137, 32, 91, 68, 119, 112, 70, 120, 62, 118, 114, 66, 55, 99, 72, 91, 109, 53, 94, 71, 145, 146, 106, 15, 83, 104, 61, 129, 51, 58, 59, 113, 107, 94, 94, 69, 118, 74, 124, 107, 99, 66, 115, 159, 71, 115, 122, 76, 68, 79, 107, 81, 104, 87, 106, 105, 112, 111, 79, 54, 108, 62, 115, 36, 74, 84, 75, 64, 92, 64, 82, 77, 56, 75, 69, 88, 105, 96, 61, 84, 106, 31, 53, 173, 102, 99, 124, 87, 70, 25, 19, 122, 101, 126, 60, 94, 78, 97, 64, 45, 92, 114, 87, 96, 160, 88, 66, 40, 124, 103, 60, 129, 120, 35, 95, 56, 76, 116, 65, 7, 103, 160, 63, 134, 101, 56, 50, 89, 92, 99, 89, 120, 47, 58, 47, 74, 124, 8, 93, 121, 53, 66, 63, 90, 114, 91, 71, 123, 55, 142, 97, 69, 141, 92, 76, 69, 74, 66, 90, 81, 96, 110, 61, 58, 62, 50, 125, 106, 115, 79, 94, 118, 117, 64, 99, 55, 53, 93, 57, 116, 61, 125, 10, 119, 74, 64, 77, 127, 115, 59, 53, 99, 81, 68, 101, 43, 122, 129, 109, 108, 84, 103, 59, 105, 76, 122, 101, 101, 108, 79, 75, 60, 111, 97, 104, 82, 67, 96, 70, 96, 104, 103, 66, 89, 114, 121, 119, 104, 93, 156, 108, 88, 98, 52, 112, 65, 99, 107, 90, 107, 115, 73, 106, 100, 120, 128, 66, 116, 69, 113, 69, 103, 62, 124, 110, 124, 72, 76, 115, 73, 84, 95, 100, 51, 61, 82, 97, 106, 68, 112, 69, 115, 67, 80, 72, 63, 123, 92, 101, 61, 69, 103, 112, 70, 59, 91, 90, 102, 111, 41, 101, 90, 33, 122, 161, 161]
y1 = [37, 51, 35, 67, 94, 114, 62, 24, 64, 92, 55, 11, 74, 38, 79, 77, 90, 77, 70, 70, 41, 46, 81, 83, 81, 65, 63, 43, 56, 95, 26, 8, 68, 82, 44, 78, 77, 72, 45, 68, 83, 99, 100, 58, 91, 89, 115, 34, 46, 68, 79, 71, 41, 43, 48, 83, 67, 69, 42, 55, 63, 69, 47, 67, 102, 72, 33, 77, 67, 1, 123, 59, 69, 47, 73, 79, 89, 48, 55, 97, 56, 92, 121, 70, 48, 47, 114, 62, 84, 78, 54, 55, 79, 76, 62, 63, 83, 71, 74, 83, 50, 67, 84, 81, 75, 59, 12, 77, 97, 6, 26, 55, 10, 74, 58, 59, 77, 76, 77, 68, 60, 50, 53, 89, 76, 87, 67, 86, 86, 73, 79, 74, 62, 54, 67, 58, 23, 76, 95, 63, 38, 76, 117, 18, 52, 46, 98, 62, 44, 36, 86, 52, 74, 51, 85, 100, 75, 73, 63, 38, 64, 91, 47, 70, 77, 88, 70, 88, 88, 39, 52, 45, 79, 56, 74, 60, 59, 69, 116, 44, 55, 48, 70, 83, 66, 87, 78, 73, 58, 76, 46, 50, 43, 81, 102, 45, 115, 88, 80, 34, 55, 55, 97, 103, 112, 122, 111, 97, 90, 81, 22, 36, 87, 86, 48, 39, 42, 83, 57, 16, 100, 89, 115, 75, 69, 86, 69, 69, 74, 39, 52, 23, 63, 49, 92, 96, 71, 105, 10, 75, 84, 80, 30, 30, 59, 52, 32, 119, 107, 74, 79, 101, 106, 99, 77, 66, 89, 83, 102, 94, 97, 78, 91, 93, 16, 11, 33, 16, 78, 50, 30, 26, 79, 34, 32, 86, 64, 40, 63, 51, 58, 52, 92, 98, 35, 36, 34, 47, 86, 88, 60, 80, 92, 96, 94, 94, 98, 111, 49, 54, 56, 36, 72, 94, 92, 102, 105, 32, 40, 30, 73, 59, 107, 39, 46, 40, 53, 57, 93, 92, 63, 59, 65, 68, 81, 69, 56, 53, 53, 85, 56, 55, 93, 45, 40, 68, 101, 93, 29, 44, 93, 93, 46, 67, 38, 34, 97, 93, 72, 90, 62, 68, 32, 31, 74, 71, 59, 38, 51, 95, 73, 82, 5, 53, 50, 34, 49, 43, 82, 77, 65, 88, 87, 89, 30, 38, 45, 36, 79, 89, 88, 100, 98, 45, 41, 20, 35, 51, 77, 64, 60, 63, 33, 44, 78, 82, 83, 70, 74, 78, 41, 61, 71, 40, 124, 82, 67, 121, 5, 65, 66]
roi1 = [[152.5078125, 3.7060546875], [158.8408203125, 12.455078125], [165.5126953125, 25.3154296875], [170.796875, 38.787109375], [171.013671875, 46.02734375], [172.6083984375, 53.0615234375], [172.6083984375, 63.9306640625], [174.419921875, 70.9169921875], [174.419921875, 85.41015625], [175.947265625, 92.4296875], [175.7998046875, 103.2626953125], [169.52734375, 116.3212890625], [166.9765625, 118.89453125], [159.7451171875, 119.2177734375], [152.7265625, 121.01953125], [138.2333984375, 121.029296875], [131.21875, 122.8408203125], [73.248046875, 122.8408203125], [66.2119140625, 124.5546875], [58.966796875, 124.65234375], [51.9990234375, 126.4638671875], [23.013671875, 126.4638671875], [19.42578125, 125.958984375], [16.5361328125, 123.7734375], [10.20703125, 115.0283203125], [0.5068359375, 95.57421875], [0.5537109375, 91.951171875], [9.0869140625, 80.318359375], [12.552734375, 73.9599609375], [18.884765625, 65.2119140625], [25.89453125, 56.994140625], [35.611328125, 41.7626953125], [42.345703125, 33.296875], [45.7568359375, 26.90625], [53.634765625, 14.7744140625], [58.1103515625, 9.078125], [64.916015625, 6.8984375], [86.654296875, 6.8984375], [93.6904296875, 5.291015625], [100.89453125, 4.57421875], [104.2763671875, 3.275390625], [122.3837890625, 3.025390625], [129.3935546875, 1.4638671875], [143.8857421875, 1.4638671875], [147.376953125, 0.4931640625]]
clarkEvans (x1, y1, roi1)
x2 = [94, 111, 79, 95, 86, 46, 30, 34, 53, 17, 44, 20, 42, 56, 23, 21, 50, 16, 50, 52, 47, 132, 44, 40, 43, 33, 29, 52, 24, 125, 86, 84]
y2 = [17, 71, 94, 88, 108, 132, 116, 115, 121, 132, 120, 121, 123, 116, 116, 139, 121, 124, 116, 140, 141, 33, 119, 118, 125, 130, 123, 122, 40, 23, 80, 107]
roi2 = [[129.4560546875, 3.6552734375], [132.3408203125, 5.84765625], [134.4638671875, 12.7744140625], [134.4638671875, 45.3828125], [132.65234375, 56.0302734375], [132.65234375, 66.8994140625], [131.4169921875, 70.3056640625], [130.7021484375, 77.5029296875], [129.029296875, 84.5419921875], [129.029296875, 88.1650390625], [127.2177734375, 95.1728515625], [127.16796875, 106.04296875], [125.40625, 113.0712890625], [125.40625, 116.6943359375], [123.896484375, 119.9873046875], [120.6533203125, 121.6025390625], [110.0654296875, 123.9130859375], [99.6181640625, 126.8427734375], [89.896484375, 131.7041015625], [83.8388671875, 135.638671875], [77.03515625, 138.134765625], [73.4228515625, 138.40625], [56.181640625, 143.8408203125], [45.568359375, 142.029296875], [31.076171875, 142.029296875], [27.4736328125, 141.6455078125], [20.626953125, 139.302734375], [15.5029296875, 134.1787109375], [11.0546875, 128.5234375], [4.537109375, 115.5849609375], [0.40625, 98.0078125], [0.513671875, 72.646484375], [4.927734375, 55.0859375], [11.4091796875, 42.123046875], [15.333984375, 36.0859375], [21.94921875, 27.4912109375], [34.7587890625, 14.6806640625], [43.416015625, 8.205078125], [57.9013671875, 7.970703125], [61.28125, 6.666015625], [71.9052734375, 4.541015625], [82.7705078125, 4.34765625], [89.7578125, 2.5361328125], [107.873046875, 2.5361328125], [114.8916015625, 1.015625], [122.126953125, 0.724609375]]
clarkEvans (x2, y2, roi2)
Using the original R function yields similar but not equal results:
clarkevans.test(X, alternative = "clustered")
>R= 0.87719, p-value = 9.542e-07 # First dataset
>R= 0.83365, p-value = 0.03591 # Second dataset
I'm not sure if the statistic and p-value calculation are valid since my study areas are irregularly shaped. The variable SE is calculated with pi, which seems like it is estimating a random distribution in a circular study area. Should I do Monte Carlo simulations instead? Is there a way of avoiding that?
Cheers!
I have not worked with the Clark-Evans (CE) index before, but having read the information you linked to and studied your code, my interpretation is this:
The index value for Dataset2 is less than the index value for Dataset1. This correctly reflects the visual difference in clusteredness, that is, the smaller index value is associated with data that is more clustered.
It is probably not meaningful to say that two CE index values are similar, other than special cases like observing that two CE index values are both smaller than 1 or both greater than 1, or if A < B < C then AB are more similar than AC.
The p-value and the index value measure different things. The index value measures degree of clusteredness (if less than 1) or regularity (if greater than 1). The p-value (inversely) measures how certain it is that the data are more clustered than would be expected by chance, or more regular than would be expected by chance. The p-value in particular is sensitive to the sample size as well as the distribution of points.
The use of pi in calculating SE reflects the assumption of Euclidean distances between points (rather than, say, city block distances). That is, the nearest neighbour of a point is the one at the smallest radial distance. The use of pi in calculating SE does not make any assumptions about the shape of the region of interest.
Particularly for small datasets (like Dataset2) you will want to track down information about the potential impact of boundary effects on the index value or the p-value.
More speculatively, I wonder if it would be useful to use a convex hull to help determine the region of interest rather than do this subjectively.

SVC python output showing the same value of "1" for every C or gamma used

This is the code:
import numpy as np
from sklearn import svm
numere=np.fromfile("sat.trn",dtype=int,count=-1,sep=" ")
numereTest=np.fromfile("sat.tst",dtype=int,count=-1,sep=" ")
numere=numere.reshape(int(len(numere)/37),37)
numereTest=numereTest.reshape(int(len(numereTest)/37),37)
etichete=numere[0:int(len(numere)),36]
eticheteTest=numereTest[0:int(len(numereTest)),36]
numere=np.delete(numere,36,1)
numereTest=np.delete(numereTest,36,1)
clf=svm.SVC(kernel='rbf',C=1,gamma=1)
clf.fit(numere,etichete)
predictie=clf.predict(numereTest)
I took the data from a file that has it all and then I made 2 np.arrays with them, but the output is 1 everything I do.
numere(:10)-->array([[ 92, 115, 120, 94, 84, 102, 106, 79, 84, 102, 102, 83, 101,
126, 133, 103, 92, 112, 118, 85, 84, 103, 104, 81, 102, 126,
134, 104, 88, 121, 128, 100, 84, 107, 113, 87],
[ 84, 102, 106, 79, 84, 102, 102, 83, 80, 102, 102, 79, 92,
112, 118, 85, 84, 103, 104, 81, 84, 99, 104, 78, 88, 121,
128, 100, 84, 107, 113, 87, 84, 99, 104, 79],
[ 84, 102, 102, 83, 80, 102, 102, 79, 84, 94, 102, 79, 84,
103, 104, 81, 84, 99, 104, 78, 84, 99, 104, 81, 84, 107,
113, 87, 84, 99, 104, 79, 84, 99, 104, 79],
[ 80, 102, 102, 79, 84, 94, 102, 79, 80, 94, 98, 76, 84,
99, 104, 78, 84, 99, 104, 81, 76, 99, 104, 81, 84, 99,
104, 79, 84, 99, 104, 79, 84, 103, 104, 79],
[ 84, 94, 102, 79, 80, 94, 98, 76, 80, 102, 102, 79, 84,
99, 104, 81, 76, 99, 104, 81, 76, 99, 108, 85, 84, 99,
104, 79, 84, 103, 104, 79, 79, 107, 109, 87],
[ 80, 94, 98, 76, 80, 102, 102, 79, 76, 102, 102, 79, 76,
99, 104, 81, 76, 99, 108, 85, 76, 103, 118, 88, 84, 103,
104, 79, 79, 107, 109, 87, 79, 107, 109, 87],
[ 76, 102, 106, 83, 76, 102, 106, 87, 80, 98, 106, 79, 80,
107, 118, 88, 80, 112, 118, 88, 80, 107, 113, 85, 79, 107,
113, 87, 79, 103, 104, 83, 79, 103, 104, 79],
[ 76, 102, 106, 87, 80, 98, 106, 79, 76, 94, 102, 76, 80,
112, 118, 88, 80, 107, 113, 85, 80, 95, 100, 78, 79, 103,
104, 83, 79, 103, 104, 79, 79, 95, 100, 79],
[ 76, 89, 98, 76, 76, 94, 98, 76, 76, 98, 102, 72, 80,
95, 104, 74, 76, 91, 104, 74, 76, 95, 100, 78, 75, 91,
96, 75, 75, 91, 96, 71, 79, 87, 93, 71],
[ 76, 94, 98, 76, 76, 98, 102, 72, 76, 94, 90, 76, 76,
91, 104, 74, 76, 95, 100, 78, 76, 91, 100, 74, 75, 91,
96, 71, 79, 87, 93, 71, 79, 87, 93, 67]])
Ok so the most likely reason for what you get is:
Firstly you do not use scaling for the data, try to use standard scaler.
scaler = StandardScaler()
scaler.fit(numere)
numere = scaler.transform(numere)
numereTest = scaler.transform(numereTest)
Secondly you are not tuning your parameters, you need to select the best fitting parameters, I strongly recommend using grid search. You can find an example how to use it here. Grid search is good for parameter tuning but take care to not use cross validation in this dataset, that is recommendation from its creators :) Gamma and C can get to wide values from very low decimal numbers to very high numbers, you can't test it properly manually.
Edit: you should not use CV so this is better way for you to do grid search
grid = { #edit ´this with more values
'gamma': [0.001, 0.1, 10, 100, 1000, ],
'C': [1, 10, 100]
}
for g in ParameterGrid(grid):
clf.set_params(**g)
clf.fit(numere, etichete)
# save if best
score = clf.score(numereTest, eticheteTest)
if score > best_score:
best_score = score
best_grid = g
print ("best score:", best_score)
print ("Grid:", best_grid)

Averaging Vertically in Nested Lists

I am writing a grade book program that sets a nested list with assignments as columns and individual students along the rows. The program must calculate the average for each assignment and the average for each student. I've got the average by student, but now I can't figure out how to calculate the average by assignment. Any help would be appreciated!
# gradebook.py
# Display the average of each student's grade.
# Display tthe average for each assignment.
gradebook = [61, 74, 69, 62, 72, 66, 73, 65, 60, 63, 69, 63,
62, 61, 64],
[73, 80, 78, 76, 76, 79, 75, 73, 76, 74, 77, 79, 76,
78, 72],
[90, 92, 93, 92, 88, 93, 90, 95, 100, 99, 100, 91, 95, 99, 96],
[96, 89, 94, 88, 100, 96, 93, 92, 94, 98, 90, 90, 92, 91, 94],
[76, 76, 82, 78, 82, 76, 84, 82, 80, 82, 76, 86, 82, 84, 78],
[93, 92, 89, 84, 91, 86, 84, 90, 95, 86, 88, 95, 88, 84, 89],
[63, 66, 55, 67, 66, 68, 66, 56, 55, 62, 59, 67, 60, 70, 67],
[86, 92, 93, 88, 90, 90, 91, 94, 90, 86, 93, 89, 94, 94, 92],
[89, 80, 81, 89, 86, 86, 85, 80, 79, 90, 83, 85, 90, 79, 80],
[99, 73, 86, 77, 87, 99, 71, 96, 81, 83, 71, 75, 91, 74, 72]]
#make variable for assingment averages
#make a variable for student averages
stu_avg = [sum(row)/len(row) for row in gradebook]
print(stu_avg)
#Assignment Class
class Assignment:
def __init__(self, name, average):
self.average = average
self.name = name
def print_grade(self):
print("Assignment", self.name, ":", self.average)
#Student Class
class Student:
def __init__(self, name, average):
self.average = average
self.name = name
def print_grade(self):
print("Student", self.name, ":", self.average)
s1 = Student("1", stu_avg[0])
s2 = Student("2", stu_avg[1])
s3 = Student("3", stu_avg[2])
s4 = Student("4", stu_avg[3])
s5 = Student("5", stu_avg[4])
s6 = Student("6", stu_avg[5])
s7 = Student("7", stu_avg[6])
s8 = Student("8", stu_avg[7])
s9 = Student("9", stu_avg[8])
s10 = Student("10", stu_avg[9])
s1.print_grade()
s2.print_grade()
s3.print_grade()
s4.print_grade()
s5.print_grade()
s6.print_grade()
s7.print_grade()
s8.print_grade()
s9.print_grade()
s10.print_grade()
Instead of using loops, let's use matrices. They make calculation much much faster, especially when dealing with large datasets.
As an example,
Per student:
[1, 2, 3, 4] [1]
[4, 5, 6, 6] x [1]
[1, 1, 3, 1] [1]
Per assignment
[1, 2, 3, 4]T [1]
[4, 5, 6, 6] x [1]
[1, 1, 3, 1] [1]
The first operation returns per student sum, and the second returns per test sum. Divide appropriately to get the average.
Using numpy
import numpy as np
gradebook = [[61, 74, 69, 62, 72, 66, 73, 65, 60, 63, 69, 63, 62, 61, 64],
[73, 80, 78, 76, 76, 79, 75, 73, 76, 74, 77, 79, 76, 78, 72],
[90, 92, 93, 92, 88, 93, 90, 95, 100, 99, 100, 91, 95, 99, 96],
[96, 89, 94, 88, 100, 96, 93, 92, 94, 98, 90, 90, 92, 91, 94],
[76, 76, 82, 78, 82, 76, 84, 82, 80, 82, 76, 86, 82, 84, 78],
[93, 92, 89, 84, 91, 86, 84, 90, 95, 86, 88, 95, 88, 84, 89],
[63, 66, 55, 67, 66, 68, 66, 56, 55, 62, 59, 67, 60, 70, 67],
[86, 92, 93, 88, 90, 90, 91, 94, 90, 86, 93, 89, 94, 94, 92],
[89, 80, 81, 89, 86, 86, 85, 80, 79, 90, 83, 85, 90, 79, 80],
[99, 73, 86, 77, 87, 99, 71, 96, 81, 83, 71, 75, 91, 74, 72]]
def get_student_average(gradebook):
number_of_students = len(gradebook[0])
number_of_assignments = len(gradebook)
matrix = [1] * number_of_students
# [1, 1, 1, 1, ...]. This is 1 * 15. Need to transpose to make it 15*1
# Converting both to numpy matrices
matrix = np.array(matrix)
gradebook = np.array(gradebook)
# Transposing matrix and multiplying them
print(gradebook.dot(matrix.T))
def get_assignment_average(gradebook):
number_of_students = len(gradebook[0])
number_of_assignments = len(gradebook)
matrix = [1] * number_of_assignments
# [1, 1, 1, ...] . This is 1 * 10. Need to transpose to make it 10*1
matrix = np.array(matrix)
gradebook = np.array(gradebook)
gradebook = gradebook.T
matrix = matrix.T
print(gradebook.dot(matrix))
get_student_average(gradebook)
get_assignment_average(gradebook)
Results
student_avg -> [ 984 1142 1413 1397 1204 1334 947 1362 1262 1235]
test_avg -> [826 814 820 801 838 839 812 823 810 823 806 820 830 814 804]

Categories

Resources