How does XGBoost/lightGBM evaluate ndcg for ranking tasks?

How does XGBoost/lightGBM evaluate ndcg for ranking tasks? - python

I am currently running tests between XGBoost/lightGBM for their ability to rank items. I am reproducing the benchmarks presented here: https://github.com/guolinke/boosting_tree_benchmarks.
I have been able to successfully reproduce the benchmarks mentioned in their work. I want to make sure that I am correctly implementing my own version of the ndcg metric and also understanding the ranking problem correctly.
My questions are:
When creating the validation for the test set using ndcg - there is a test.group file that says the first X rows are group 0, etc. To get the recommendations for the group, I get the predicted values and known relevance scores and sort that list by descending predicted values for each group?
In order to get the final ndcg scores from the lists created above - do I get the ndcg scores and take the mean over all the scores? Is this the same evaluation methodology that XGBoost/lightGBM in the evaluation phase?
Here is my methodology for evaluating the test set after the model has finished training.
For the final tree when I run lightGBM I obtain these values on the validation set:
[500] valid_0's ndcg#1: 0.513221 valid_0's ndcg#3: 0.499337 valid_0's ndcg#5: 0.505188 valid_0's ndcg#10: 0.523407
My final step is to take the predicted output for the test set and calculate the ndcg values for the predictions.
Here is my python code for calculating ndcg:
import numpy as np
def dcg_at_k(r, k):
r = np.asfarray(r)[:k]
if r.size:
return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
return 0.
def ndcg_at_k(r, k):
idcg = dcg_at_k(sorted(r, reverse=True), k)
if not idcg:
return 0.
return dcg_at_k(r, k) / idcg
After I get the predictions for the test set for a particular group (GROUP-0) I have these predictions:
query_id predict
0 0 (2.0, -0.221681199441)
1 0 (1.0, 0.109895548348)
2 0 (1.0, 0.0262799346312)
3 0 (0.0, -0.595343431322)
4 0 (0.0, -0.52689043426)
5 0 (0.0, -0.542221350664)
6 0 (1.0, -0.448015576024)
7 0 (1.0, -0.357090949646)
8 0 (0.0, -0.279677741045)
9 0 (0.0, 0.2182200869)
NOTE
Group-0 actually has about 112 rows.
I then sort the list of tuples in descending order which provides a list of relevance scores:
def get_recommendations(x):
sorted_list = sorted(list(x), key=lambda i: i[1], reverse=True)
return [k for k, _ in sorted_list]
relavance = evaluation.groupby('query_id').predict.apply(get_recommendations)
query_id
0 [4.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
1 [4.0, 2.0, 2.0, 2.0, 1.0, 1.0, 3.0, 2.0, 1.0, ...
2 [2.0, 3.0, 2.0, 2.0, 1.0, 0.0, 2.0, 2.0, 1.0, ...
3 [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
4 [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...
Finally, for each query id I calculated the ndcg scores on the relevance list and then take the mean of all the ndcg scores calculated for each query id:
relavance.apply(lambda x: ndcg_at_k(x, 10)).mean()
The value I obtain is ~0.497193.

Cross-posting my Cross Validated answer to this cross-posted question:
https://stats.stackexchange.com/questions/303385/how-does-xgboost-lightgbm-evaluate-ndcg-metric-for-ranking/487487#487487
I happened across this myself, and finally dug into the code to figure it out.
The difference is the handling of a missing IDCG. Your code returns 0, while LightGBM is treating that case as a 1.
The following code produced matching results for me:
import numpy as np
def dcg_at_k(r, k):
r = np.asfarray(r)[:k]
if r.size:
return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
return 0.
def ndcg_at_k(r, k):
idcg = dcg_at_k(sorted(r, reverse=True), k)
if not idcg:
return 1. # CHANGE THIS
return dcg_at_k(r, k) / idcg

I think the problem is caused by data in the same query that have same labels.
In that case, Both XGBoost and LightGBM will produce ndcg 1 for that query.

Related

Calculating length of Polyline using a loop? python

I need to calculate the length of this poly line with the following coordinates formatted this exact way:
coords = [[1.0, 1.0], [1.5, 2.0], [2.2, 2.4], [3.0, 3.2], [4.0, 3.6], [4.5, 3.5], [4.8, 3.2], [5.2, 2.8], [5.6, 2.0],
[6.5, 1.2]]
using this distance formula 𝑑 = √(𝑥2 − 𝑥1)
2 + (𝑦2 − 𝑦1)
our lab wants us to use a Loop, and be able to use the same code on other sets of coordinates. I am lost on where to start.

Something like this could work?
Define a function for the Euclidean distance:
import math
def distance(x1,x2,y1,y2):
return math.sqrt((x2-x1)**2 + (y2-y1)**2)
Then, another function which receives coords as input and gives back the sum of point-to-point euclidean distances
def compute_poly_line_lenth(coords):
distances = []
for i in range(len(coords)-1):
current_line = coords[i]
next_line = coords[i+1]
distances.append(distance(current_line[0],next_line[0],current_line[1],next_line[1]))
return sum(distances)

Round up and round down of a given number to the multiples of floating poing number in python

I am trying to extract the round up and round down of given values to the nearest 0.25 multiples. I have tried using round, ceil and floor functions but not getting the desired output. Is there any other function I can use to achieve the expected output shared in the below code snippet.
import math
x=[0.14,2.31,1.56,1.98,0.12,0.11,0.13,0.25]
lup=[]
ldown=[]
base=0.25
for i in x:
x1= base*(round(float(i/base)))
x2= base*(math.floor(float(i/base)))
lup.append(x1)
ldown.append(x2)
# print(F"value:{i},fraction:{float(i/base)},fround:{round(float(i/base))},lup:{x1},ldown:{x2}")
print(F"lup={lup}")
print(F"ldown={ldown}")
expected output:
lup = [0.25,2.5,1.75,2.0,0.25,0.25,0.25]
ldown=[0,2.25,1.5,1.75,0.0,0.0,0.0,0.25]
obtained output:
lup=[0.25, 2.25, 1.5, 2.0, 0.0, 0.0, 0.25, 0.25]
ldown=[0.0, 2.25, 1.5, 1.75, 0.0, 0.0, 0.0, 0.25]

Use math.ceil(The opposite of math.floor) instead of math.round:
x1= base*(math.ceil(float(i/base)))

H5PY - How to store many 2D arrays of different dimensions

I would like to organize my collected data (from computer simulations) into a hdf5 file using Python.
I measured positions and velocities [x,y,z,vx,vy,vz] of all atoms within a certain space region over many time steps. The number of atoms, of course, varies from time step to time step.
A minimal example could look as follows:
[
[ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2] ],
[ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2], [x3,y3,z3,vx3,vy3,vz3] ]
]
(2 time steps,
first time step: 2 atoms,
second time step: 3 atoms)
My idea was to create a hdf5 dataset within Python which stores all the information. At each time step it should store a 2d array of alls positions/velocities of all atoms, i.e.
dataset[0] = [ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2] ]
dataset[1] = [ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2], [x3,y3,z3,vx3,vy3,vz3] ].
The idea is clear, I think. However, I struggle with the definition of the correct data type of the data set with varying array length.
My code looks like this:
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
rowtype = np.dtype("%sfloat32" % columnNo)
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
dataset = file.create_dataset("dset", (2,), dtype=dt)
print dataset.value
testarray = np.array([[1.,2.,3.,2.,3.,4.],[1.,2.,3.,2.,3.,4.]])
print testarray
dataset[0] = testarray
print dataset[0]
This, however, does not work. When I run the script I get the error message "AttributeError: 'float' object has no attribute 'dtype'."
It seems that my defined dtype is wrong.
Does anybody see how it should be defined correctly?
Thanks very much,
Sven

The error in your case is buried, though it is clear it occurs when trying to assign the testarray to the dataset:
Traceback (most recent call last):
File "stack41465480.py", line 26, in <module>
dataset[0] = testarray
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-GhwtGD/h5py-2.6.0/h5py/_objects.c:2577)
...
File "h5py/_conv.pyx", line 712, in h5py._conv.ndarray2vlen (/build/h5py-GhwtGD/h5py-2.6.0/h5py/_conv.c:6171)
AttributeError: 'float' object has no attribute 'dtype'
I'm not skilled with the special_dtype and vlen, but I was able to write a numpy structured arrays to h5py.
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
# rowtype = np.dtype("%sfloat32" % columnNo)
rowtype = np.dtype([('f0', '<f4',(6,))])
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
print('rowtype',rowtype)
print('dt',dt)
dataset = file.create_dataset("dset", (2,), dtype=rowtype)
print('value')
print(dataset.value[0])
arr = np.ones((2,),dtype=rowtype)
print(repr(arr))
dataset[0] = arr[0]
print(dataset.value)
testarray = np.array([([1.,2.,3.,2.,3.,4.],),([2.,3.,4.,1.,2.,3.],)], dtype=rowtype)
print(repr(testarray))
dataset[1] = testarray[1]
print(dataset.value)
print(dataset.value['f0'])
producing
1316:~/mypy$ python3 stack41465480.py
rowtype [('f0', '<f4', (6,))]
dt object
value
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)
array([([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],), ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)],
dtype=[('f0', '<f4', (6,))])
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)]
array([([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],), ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)],
dtype=[('f0', '<f4', (6,))])
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)]
[[ 1. 1. 1. 1. 1. 1.]
[ 2. 3. 4. 1. 2. 3.]]

Thanks for the quick answer. It helped a lot.
If I now simply change the data type of the data set to
dtype = dt,
I get what I would like to have.
Here, the Python code (for completeness):
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
rowtype = np.dtype([('f0', '<f4',(6,))])
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
print('rowtype',rowtype)
print('dt',dt)
dataset = file.create_dataset("dset", (2,), dtype=dt)
# print('value')
# print(dataset.value[0])
arr = np.ones((3,),dtype=rowtype)
# print(repr(arr))
dataset[0] = arr
# print(dataset.value)
testarray = np.array([([1.,2.,3.,2.,3.,4.],),([2.,3.,4.,1.,2.,3.],)], dtype=rowtype)
# print(repr(testarray))
dataset[1] = testarray
print(dataset.value)
for i in range(2): print dataset[i]
And to corresponding output reads
('rowtype', dtype([('f0', '<f4', (6,))]))
('dt', dtype('O'))
[ array([([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],),
([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],), ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)],
dtype=[('f0', '<f4', (6,))])
array([([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],), ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)],
dtype=[('f0', '<f4', (6,))])]
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)
([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)]
[([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],) ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)]
Just to get it right: The problem in my original code was a bad definition of my rowtype data structure, right?
Best,
Sven

Similarity Measure/Matrix for data (recommender system)- Python

I am new to machine learning and am trying to try out the following problem.
Input is 2 arrays of descriptions with same length, and output is an array of similarity scores of first string from first array compared to first string in second array etc.
Each item in the array(numpy array) is a string of description. Can you write a function find out how similar between two strings by calculating how many identical and co-occurring word IDs there are, and assign it a score (one possible weight can be based on the frequency of co-occurrence vs sum of frequency of individual word ID). Then apply the function to two arrays to get an array of scores.
Please also let me know if there are other approaches you would want to to consider as well.
Thanks!
Data:
array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12',
'18/19/20/21/22/23/24/25',
'26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41',
'5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22',
'57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69',
'70/71/72/73/74/75/76/77',
'78/79/80/81/82/83/84/85/86/87/88/89/90/91',
'33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103',
'104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116',
'117/118/119/120/121/12/122/123/124/125',
'14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136',
'137/138/139/140/141/142',
'143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159',
'160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171',
'172/173/174/175/176/177/73/178/104/179/180/179/181/173',
'182/144/183/179/73',
'184/163/68/185/163/8/186/187/188/54/189/190/191',
'181/192/0/1/193/194/22/195',
'113/196/197/198/68/199/68/200/201/202/203/201',
'204/205/206/207/208/209/68/200',
'163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219',
'220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223',
'214/228/5/6/5/215/228/228/229',
'230/231/232/233/122/215/128/214/128/234/234',
'235/236/191/237/92/93/238/239',
'13/14/44/44/240/241/242/49/54/243/244/245/55/56',
'220/21/246/38/247/201/248/73/160/249/250/203/201',
'214/49/251/252/253/254/255/256/257/258'],
dtype='|S127')
array(['151/308/309/310/311/215/312/160/313/214/49/12',
'314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323',
'324/325/62/220/326/194/327/328/218/76/241/329',
'330/29/22/103/331/314/68/80/49',
'78/332/85/96/97/227/333/4/334/188',
'57/335/336/34/187/337/21/338/212/213/339/340',
'341/342/167/343/8/254/154/61/344',
'2/292/345/346/42/347/348/348/100/349/202/161/263',
'283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355',
'137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362',
'23/363/10/364/289/68/123/354/355',
'188/28/365/149/366/98/367/368/369/370/371/372/368',
'373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19',
'179/376/377/378/179/87/88/379/20',
'380/85/381/333/382/215/128/383/384', '385/129/386/387/388',
'389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396',
'397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80',
'77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156',
'129/295/90/259/38/39/119/414/415/416/14/318/417/418',
'419/420/421/422/423/23/424/241/421/425/58',
'426/244/427/5/428/49/76/429/430/431',
'257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170',
'439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448',
'385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'],
dtype='|S127')

The following code should facilitate you with what you need in Python 3.x
import numpy as np
from collections import Counter
def jaccardSim(c1, c2):
cU = c1 | c2
cI = c1 & c2
sim = sum(cI.values()) / sum(cU.values())
return sim
def byteArraySim(b1, b2):
cA = [Counter(b1[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b1))]
cB = [Counter(b2[i].decode(encoding="utf-8", errors="strict").split("/"))
for i in range(len(b2))]
# Assuming both 'a' and 'b' are in the same length
cSim = [jaccardSim(cA[i], cB[i]) for i in range(len(a))]
return cSim # Array of similarities
Jaccard Similarity score is used in this implementation. You may other scores, such as cosine or hamming, to your liking.
Assuming that the arrays are stored in variables a and b, the resulting function byteArraySim(a,b) outputs the following similarity scores:
[0.0,
0.0,
0.0,
0.038461538461538464,
0.0,
0.041666666666666664,
0.0,
0.0,
0.0,
0.08,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.058823529411764705,
0.0,
0.0,
0.0,
0.05555555555555555,
0.0,
0.0,
0.0,
0.0,
0.0]

Input....and every element must be continuous Error using sklearn MultinominalHMM

Trying to create a left right discrete HMM in sklearn to recognize words from recognized characters. symbol set is all " " + 26 letters for 27 total symbols.
import numpy as np
from sklearn import hmm
# alphabet is symbols
symbols = [' ','a','b','c','d','e','f','g','h','i','j', #0-10
'k','l','m','n','o','p','q','r','s','t', #11-20
'u','v','w','x','y','z'] #21-26
num_symbols = len(symbols)
# words up to 6 letters
n_states = 6
obsONE = np.array([ [0,0,15,14,5,0], # __one_
[15,14,5,0,0,0], # one___
[0,0,0,15,14,5], # ___one
[0,15,14,5,0,0], # _one__
[0,0,16,14,5,0], # __pne_
[15,14,3,0,0,0], # onc___
[0,0,0,15,13,5], # ___ome
[0,15,14,5,0,0], # _one__
[0,0,15,14,5,0], # __one_
[15,14,5,0,10,15], # one_jo
[1,14,0,15,14,5], # an_one
[0,15,14,5,0,16], # _one_p
[20,0,15,14,5,0], # t_one_
[15,14,5,0,10,15], # one_jo
[21,20,0,15,14,5], # ut_one
[0,15,14,5,0,20], # _one_t
[21,0,15,14,5,0], # u_one_
[15,14,5,0,10,15], # one_jo
[0,0,0,15,14,5], # an_one
[0,15,14,5,0,26], # _one_z
[5,20,0,15,14,5] ])
pi = np.array([1.0, 0.0, 0.0, 0.0, 0.0, 0.0]) # initial state is the left one always
A = np.array([[0.0, 1.0, 0.0, 0.0, 0.0, 0.0], # node 1 goes to node 2
[0.0, 0.5, 0.5, 0.0, 0.0, 0.0], # node 2 can self loop or goto 3
[0.0, 0.0, 0.5, 0.5, 0.0, 0.0], # node 3 can self loop or goto 4
[0.0, 0.0, 0.0, 0.5, 0.5, 0.0], # node 4 can self loop or goto 5
[0.0, 0.0, 0.0, 0.0, 0.5, 0.5], # node 5 can self loop or goto 6
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]]) # node 6 goes to node 1
model = hmm.MultinomialHMM(n_components=n_states,
startprob=pi, # this is the start matrix, pi
transmat=A, # this is the transition matrix, A
params='e', # update e in during training (aka B)
init_params='ste') # initialize with s,t,e
model.n_symbols = num_symbols
model.fit(obsONE)
But I get ValueError: Input must be both positive integer array and every element must be continuous.
The code seems to directly want observations to be implemented as [0,1,2,3,4,5]
How should I set this up to get to the HMM model that I want???

I faced the same problem. It seems to me that the observation sequence that is input doesn't contain some characters from the vocabulary. So instead of assigning numbers to them statically, you can assign numbers to characters after finding out what characters are present in the observation sequence.
(eg.)
Suppose
Word = 'apzaqb'
use
Symbols = ['a','b','p','q','z'] for numbering
(ie.) ObsOne = np.array([0,2,4,0,3,1])
instead of
Symbols = ['','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] for numbering
(ie.) ObsOne = np.array([1,16,26,1,17,2])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does XGBoost/lightGBM evaluate ndcg for ranking tasks? - python

I think the problem is caused by data in the same query that have same labels. In that case, Both XGBoost and LightGBM will produce ndcg 1 for that query.

Related

Calculating length of Polyline using a loop? python

Round up and round down of a given number to the multiples of floating poing number in python

H5PY - How to store many 2D arrays of different dimensions

Similarity Measure/Matrix for data (recommender system)- Python

Input....and every element must be continuous Error using sklearn MultinominalHMM

Categories

Resources