python xgboost: Is it possible merge/join/combine 2 QuantileDMatrix objects - python

Suppose I have
import xgboost as xgb
import numpy as np
COLS = 64
ROWS_PER_BATCH = 1000 # data is splited by rows
BATCHES = 2
rng = np.random.RandomState(1980)
data = [rng.randn(ROWS_PER_BATCH, COLS)] * BATCHES
q1 = xgb.QuantileDMatrix(data[0])
q2 = xgb.QuantileDMatrix(data[1])
Is it possible merge/combine q1 or q2 so that I can then do something like,
q = xgb.merge(q1,q2) #to be clear this merge fun does not exist
res = xgb.train({"tree_method": "hist"}, q)
I've been looking through the code base and struggling to understand if the approx histogram method used allows for merge like operations.
thanks in advance!

Related

Random Sample From Data frame and remains

How to select remains of data frame after random selection of data?
This will give 80% data. but I want remaining 20% also.
df.sample(frac=0.8)
You can use:
df_sample = df.sample(frac=0.8)
and then:
df_remains = df[~df.index.isin(df_sample.index)]
Since you also have numpy installed, a Pandas dependency, you can do something like this:
import numpy as np
p = .8
msk = np.random.rand(len(df)) < p
sample = df[msk]
remains = df[~msk]

Selecting specific features based on correlation values

I am using the Housing train.csv data from Kaggle to run a prediction.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
I am trying to generate a correlation and only keep the features that correlate with SalePrice from 0.5 to 0.9. I tried to use this function to fileter some of it, but I am removing the correlation values that are above .9 only.
How would I update this function to only keep those specific features that I need to generate a correlation heat map?
data = train
corr = data.corr()
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if corr.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = data.columns[columns]
data = data[selected_columns]
import pandas as pd
data = pd.read_csv('train.csv')
col = data.columns
c = [i for i in col if data[i].dtypes=='int64' or data[i].dtypes=='float64'] # dropping columns as dtype == object
main_col = ['SalePrice'] # column with which we have to compare correlation
corr_saleprice = data.corr().filter(main_col).drop(main_col)
c1 =(corr_saleprice['SalePrice']>=0.5) & (corr_saleprice['SalePrice']<=0.9)
c2 =(corr_saleprice['SalePrice']>=-0.9) & (corr_saleprice['SalePrice']<=-0.5)
req_index= list(corr_saleprice[c1 | c2].index) # selecting column with given criteria
#req_index.append('SalePrice') #if you want SalePrice column in your final dataframe too , uncomment this line
data = data[req_index]
data
Also using for loops is not so efficient, a direct implementation is favorable. I hope this is what you want!
For generating heatmap , you can use following code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
a =data.corr()
mask = np.triu(np.ones_like(a, dtype=np.bool))
plt.figure(figsize=(10,10))
_ = sns.heatmap(a,cmap=sns.diverging_palette(250, 20, n=250),square=True,mask=mask,annot=True,center=0.5)

How to compute the correlations of long format dataframe with pandas?

I have a dataframe with 3 columns.
UserId | ItemId | Rating
(where Rating is the rating a User gave to an Item. It's a np.float16. The 2 Id's are np.int32)
How do you best compute correlations between items using python pandas?
My take is to first pivot the table (wide format) and then apply pd.corr
df = df.pivot(index='UserId', columns='ItemId', values='Rating')
df.corr()
It's working on small datasets, but not on big ones.
That first step creates a big matrix dataset mostly full of missing values. It's quite ram intensive and I can't run it with bigger dataframes.
Isn't there a simpler way to compute the correlations directly on the long dataset, without pivoting?
(I looked into pd.groupBy, but that seems to only split the dataframe, not what I'm looking for.)
EDIT: oversimplified data and working pivot code
import pandas as pd
import numpy as np
d = {'UserId': [1,2,3, 1,2,3, 1,2,3],
'ItemId': [1,1,1, 2,2,2, 3,3,3],
'Rating': [1.1,4.5,7.1, 5.5,3.1,5.5, 1.1,np.nan,2.2]}
df = pd.DataFrame(data=d)
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
print(df.info())
pivot = df.pivot(index='UserId', columns='ItemId', values='Rating')
print('')
print(pivot)
corr = pivot.corr()
print('')
print(corr)
EDIT2: Large random data generator
def randDf(size = 100):
## MAKE RANDOM DATAFRAME, df =======================
import numpy as np
import pandas as pd
import random
import math
dict_for_df = {}
for i in ('UserId','ItemId','Rating'):
dict_for_df[i] = {}
for j in range(size):
if i=='Rating': val = round( random.random()*5, 1)
else: val = round( random.random() * math.sqrt(size/2) )
dict_for_df[i][j] = val # store in a dict
# print(dict_for_df)
df = pd.DataFrame(dict_for_df) # after the loop convert the dict to a dataframe
# print(df.head())
df = df.astype(dtype={'UserId': np.int32, 'ItemId': np.int32, 'Rating': np.float32})
# df = df.astype(dtype={'UserId': np.int64, 'ItemId': np.int64, 'Rating': np.float64})
## remove doubles -----
df.drop_duplicates(subset=['UserId','ItemId'], keep='first', inplace=True)
## show -----
print(df.info())
print(df.head())
return df
# =======================
df = randDf()
I had another go, and have something that gets exactly the same correlation numbers as your method without using pivot, but is much slower. I can't say whether it uses less or more memory:
from scipy.stats.stats import pearsonr
import itertools
import pandas as pd
import numpy as np
d = []
itemids = list(set(df['ItemId']))
pairsofitems = list(itertools.combinations(itemids,2))
for itempair in pairsofitems:
a = df[df['ItemId'] == itempair[0]][['Rating', 'UserId']]
b = df[df['ItemId'] == itempair[1]][['Rating', 'UserId']]
z = np.ones(len(set(df.UserId)), dtype=int)
z = z * np.nan
z[a.UserId.values] = a.Rating.values
w = np.ones(len(set(df.UserId)), dtype=int)
w = w * np.nan
w[b.UserId.values] = b.Rating.values
bad = ~np.logical_or(np.isnan(w), np.isnan(z))
z = np.compress(bad, z)
w = np.compress(bad, w)
d.append({'firstitem': itempair[0],
'seconditem': itempair[1],
'correlation': pearsonr(z,w)[0]})
df_out = pd.DataFrame(d, columns=['firstitem', 'seconditem', 'correlation'])
This was of help working out to handle the nan's before taking the correlation.
The slicing in the two lines after the for loop take time. I think though, it may have potential if the bottlenecks could be fixed.
Yes, there is some repetition in there with the z and w variables, could put that in a function.
Some explanation of what it does:
find all combinations of pairs within your items
organise and "x" and "y" set of points for UserId / Rating where any point pair where one of the two is missing (nan) is dropped. I think of a scatter plot and the correlation being how well a straight line fits through it.
run pearson correlation on this x-y pair
put the ItemId each pair and correlation into a dataframe

Iterating through input-scenarios and storing results as nested arrays in Python

I am trying to run my model for all 3 input-scenarios using a loop instead of having to copy paste the script 3 times and changing the input-data manually. I have 3 arrays of input-data and would like to store the results (also arrays of the same length) in separate nested arrays within the same variable. Currently, I only know how to append the results. However, it is incorrect and I want to store the results for the different scenario-runs in separate elements within the same variable.
import numpy as np
# Scenarios
years = np.arange(50)
sc0 = np.arange(50)
sc1 = np.arange(50)+100
sc2 = np.arange(50)+200
scenarios = [sc0, sc1, sc2]
results = []
# Model computes something
for sc in range(3):
for t in years:
outcome = scenarios[sc][t] / 10
results.append(outcome)
In a nutshell, the solution should allow me to access the results for all model runs using results[0], results[1], and results[2]
I have created a new list, subresults, which is created as [] for each scenario. Then this is appended to the list results after every outcome has been calculated for that scenario.
import numpy as np
# Scenarios
years = np.arange(50)
sc0 = np.arange(50)
sc1 = np.arange(50)+100
sc2 = np.arange(50)+200
scenarios = [sc0, sc1, sc2]
results = []
# Model computes something
for sc in range(3):
subresults = []
for t in years:
outcome = scenarios[sc][t] / 10
subresults.append(outcome)
results.append(subresults)
Then you access your results using results[0], results[1], and results[2].
Comprehension will do it as well:
resultsets = [[sc[t]/10 for t in years] for sc in scenarios]

How can I increase speed/performance with Scikit-learn regression and Pandas?

I am playing with the excellent Scikit-learn today. I'm forming the x's out of panels sliced on the minor_axis and y's out of DataFrame sliced on columns. At the moment I'm doing endless iterations, does any .apply() Masters out there have any idea how to speed this up ?
from pandas import *
import numpy as np
from sklearn import linear_model
np.random.seed(247)
x = Panel(np.random.rand(3,25,10))
y = y = DataFrame(np.random.rand(25,5))
r2 = Series(index=y.columns)
for i in y.columns:
X = x.ix[:,:,i]
Y = y.ix[:,i]
r2.ix[i] = linear_model.LinearRegression().fit(X,Y).score(X,Y)
In [325]: r2
Out[325]:
0 0.061945
1 0.091734
2 0.004635
3 0.015835
4 0.027906
dtype: float64
My idea was to apply this function (or similar) column wise. Have played with .apply() but because its a double(or triple) function call i.e. f1.().f2(x,y) or f1.().f2(x,y).f3(x,y) it gives me an error. Any ideas would be greatly appreciated and I think this would be a very useful bit of code to have out there!
LW
You could do your calculations in parallel. This isn't really making your code "better" but would definitely make things faster. Something like ...
Code:
#!/usr/bin/env python
import pandas as pd
import numpy as np
from sklearn import linear_model
from multiprocessing import Pool
import time
np.random.seed(247)
x = pd.Panel(np.random.rand(3, 25, 2000000))
y = pd.DataFrame(np.random.rand(25, 1000000))
def main(i):
X = x.ix[:,:,i]
Y = y.ix[:,i]
r2 = linear_model.LinearRegression().fit(X, Y).score(X, Y)
return r2
if __name__ = '__main__':
start_time = time.time()
p = Pool()
result = p.map(main, range(1000000))
print result[:2] # print first 2 r2's
end_time = time.time()
print 'Iterations took %f seconds.' % (end_time - start_time)
Output:
[0.07197, 0.24436]
"Iterations took 159.226 seconds."
I ran a million regressions and as you can see it took ~2.5 minutes. This will vary based on the number of cores you have. result will be a list of your scores so you can easily reproduce the r2 Series in your example. Good Luck!

Categories

Resources