I am working on making predicted probability of response (binary: yes or no (1, 0)) on 60,000 dispute claims each with its unique reference ID. Using the first 3/4 of the data as training set (X_train, y_train) with logistic regression as classifier to predict the response probability of the last 1/4 as test set (X_test), I would like to make the output into 60,000 indexed series, such that the output looks like
reference_id
184932 0.531842
185362 0.401958
185361 0.105928
185338 0.018572
...
276499 0.208567
276500 0.818759
269851 0.018528
Name: response, dtype: float32
I implemented the following Python code:
y_score_lr = LogisticRegression(C=10).fit(X_train, y_train).predict_proba(X_test)[:,1]
y_proba = y_score_lr
The result is an numpy array like this
array([ 0.05225495, 0.00522493, 0.07369773, ..., 0.06994582, 0.06995239, 0.12659022])
which is an numpy array.
But I am not sure if this array actually matches the corresponding reference_id in the original X_test data frame, and I haven't figured out how to convert it into an indexed "series" like the one I mentioned at the beginning of this post.
It will be very appreciated if someone could point me to helpful shortcut to achieve this.
I also tried using
y_score_lr = LogisticRegression(C=10).fit(X_train, y_train).predict_proba(X_test)[:,1]
y_proba = y_score_lr.tolist()
to convert the array into a list, but still could not make it into the desired series-type output with 'reference_id' indexed.
Thank you.
Sincerely,
First of all, yes, it matches the values of X_test: the first row corresponds to the first value in the y_proba array.
Secondly, there are several ways you can approach this problem.
One of the possible solutions may be the following, assuming you want dtype=pandas.Series:
import pandas as pd
import numpy as np
y_proba_indexed = pd.Series(
data=y_proba, index=X_test['reference_id'], name='response', dtype=np.float32)
print(y_proba_indexed)
This would give you something like this:
84932 0.531842
185362 0.401958
185361 0.105928
185338 0.018572
....
276499 0.208567
276500 0.818759
269851 0.018528
Name: response, dtype: float32
To access, for instance, a probability referring to reference_id = 185338 you may type: y_proba_indexed.loc[[185338]], the output will be:
185338 0.018572
Name: respone, dtype: float32
Related
I currently have seismic data with 175x events with 3 traces for each event (traces are numpy arrays of seismic data). I have classification labels for whether the seismic data is an earthquake or not for each of those 175 samples. I'm looking to format my data into numpy arrays for modelling. I've tried placing into a dataframe of numpy arrays with each column being a different trace. So columns would be 'Trace one' 'Trace two' 'Trace three'. This did not work. I have tried lots of different methods of arranging the data to use with keras.
I'm now looking to create a numpy matrix for the data to go into and to then use for modelling.
I had thought that the shape may be (175,3,7501) as (#number of events, #number of traces,#number of samples in trace), however I then iterate through and try to add the three traces to the numpy matrix and have failed. I'm used to using dataframes and not numpy for inputting to Keras.
newrow = np.array([[trace_copy_1],[trace_copy_2],[trace_copy_3]])
data = numpy.vstack([data, newrow])
The data shape is (175,3,7510). The newrow shape is (3,1,7510) and does not allow me to add newrow to data.
The form in which I receive the data is in obspy streams and each stream has the 3 trace objects. With each trace object, it holds the trace data in numpy arrays and so I'm having to access and append those to a dataframe for modelling as obviously I can't feed a stream or trace object to keras model.
If I understand your data correctly you can try one of the following method:
If your data shape is (175, 3, 7510) define newrow as follows newrow = np.array([trace_copy_1,trace_copy_2,trace_copy_3]) with trace_copy_x being a numpy array with shape 7510.
Use the reshape function (either with numpy.reshape(new_row, (3, 7510)) or new_row.reshape((3, 7510))
If you're familiar with dataframes you can still use pandas dataframes by reducing the dimension of your data (you can for example add the different traces at the end of one another on the same row, something you often see when working with images). Here it could be something like pandas.DataFrame(data.reshape((175, 3*7510)))
In addition to that I recommend using numpy.concatenate instead of numpy.vstack (more general).
I hope it will works.
Cheers
Thanks for the answers. The way I solved this was I created the NumPy array of the desired fit shape. (index or number of events, number of traces (or number of arrays), then sample amount (or amount of values in each array)
I then created a new row. I then reshaped and added. Following this, I then split the data to remove the original data before I started appending my new data.
data = np.zeros(shape=(175,3,7501))
newrow = [[trace_copy_1],[trace_copy_2],[trace_copy_3]]
newrow = np.array([[trace_copy_1],[trace_copy_2],[trace_copy_3]])
newrow = newrow.reshape((1,3,7501))
I'm new to Python and need some help with xarray.
I have two 3 dimensional data arrays (rlon, rlat, time) for future and past climate. I want to compute the Mann-Whitney-U-test for each grid point to analyse significance of temperature change in future compared to past. I already got the Mann-Whitney-U-test work with selecting a time serie from one grid point of historical and future data each. Example:
import numpy as np
import xarray as xr
import scipy.stats as sts
#selecting time period and grid point of past and future data
tp = fileHis['tas']
tf = fileFut['tas']
gridpoint_past=tp.sel(rlon=-6.375, rlat=1.375, time=slice('1999-01-01', '1999-01-31'))
gridpoint_future=tf.sel(rlon=-6.375, rlat=1.375, time=slice('2099-01-01', '2099-01-31'))
#mannwhintey-u-test
result=sts.mannwhitneyu(gridpoint_past, gridpoint_future, alternative='two-sided')
print('pvalue =',result[1])
Output:
pvalue = 0.05922372345359562
My problem now is that I need to do this for each grid point and each month and in the end I would like to have a data array with pvalues for each grid point and each month of a year.
I was thinking about looping through all rlat, rlon and months and run the Mann-Whitney-U-test for each, unless there is a better way to do.?
And how can I write the pvalues one by one into a new data array with the same rlat, rlon dimension?
I was trying this, but it does not work:
I created a data array pvalue_mon, which has the same rlat, rlon as tp and tf and has 12 months as time steps.
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=th.time.dt.month.isin([1])) = result[1]
SyntaxError: can't assign to function call
or this:
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=pvalue_mon.time.dt.month.isin([1])).update(result[1])
TypeError: 'numpy.float64' object is not iterable
How can I replace a single value of an existing variable?
Instead of using the .sel() function, try using .loc[ ] as described here:
http://xarray.pydata.org/en/stable/indexing.html#assigning-values-with-indexing
I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)
I'm dealing with Azure ML and my goal is to see what happens if I have a fixed quantity(in percentage) of missing values in my dataset.
My idea could be:
Starting from the dataset(take in example Adult dataset) ,duplicate the original dataset and call it for convention X. Dataset X will contain randomly missing value in the percentage of the 20%. Once we have the original dataset and the duplicated dataset X we can use a Neural Net algo , create training and test set and then train this neural net with the dataset X in input . What it could be interesting to see is the global error produced. After we can imagine to expand the range of missing values in the dataset X. Starting from 20%,after 40% and so on... I think the hardest part is to duplicate the original dataset and so create the dataset X with this missing values.
In which way I can do it? Using modules in Azure ML or maybe R/Python scripts?
Just Sharing my idea, please see the sample code & comments as below.
import numpy as np
import pandas as pd
# Origin DataFrame
df = pd.DataFrame(np.random.randn(6,4))
# Copy data via flatten data matrix as an array
array = df.values.flatten()
# insert missing data by percent
# Define the percent of missing data
percent = 0.2
size = len(array)
# generate a random list for indexing data which will be assigned NaN
chosen = np.random.choice(size, int(size*percent))
array[chosen] = np.nan
# Create a new DataFrame with missing data
df2 = pd.DataFrame(np.reshape(array, (6,4)))
Hope it helps.
I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085
I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
To predict future values, just pass new data to .predict() For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)