I'm dealing with Azure ML and my goal is to see what happens if I have a fixed quantity(in percentage) of missing values in my dataset.
My idea could be:
Starting from the dataset(take in example Adult dataset) ,duplicate the original dataset and call it for convention X. Dataset X will contain randomly missing value in the percentage of the 20%. Once we have the original dataset and the duplicated dataset X we can use a Neural Net algo , create training and test set and then train this neural net with the dataset X in input . What it could be interesting to see is the global error produced. After we can imagine to expand the range of missing values in the dataset X. Starting from 20%,after 40% and so on... I think the hardest part is to duplicate the original dataset and so create the dataset X with this missing values.
In which way I can do it? Using modules in Azure ML or maybe R/Python scripts?
Just Sharing my idea, please see the sample code & comments as below.
import numpy as np
import pandas as pd
# Origin DataFrame
df = pd.DataFrame(np.random.randn(6,4))
# Copy data via flatten data matrix as an array
array = df.values.flatten()
# insert missing data by percent
# Define the percent of missing data
percent = 0.2
size = len(array)
# generate a random list for indexing data which will be assigned NaN
chosen = np.random.choice(size, int(size*percent))
array[chosen] = np.nan
# Create a new DataFrame with missing data
df2 = pd.DataFrame(np.reshape(array, (6,4)))
Hope it helps.
Related
Here are the requirements:
Partitioning data set into train data set and test data set.
Systematic sampling should be used when partitioning data.
The train data set should be about 80% of all data points and the test data set should be 20% of them.
I have tried some codes:
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
and
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)
The codes either do systematic sampling or data partition but I'm not sure how to satisfy two conditions at the same time
Systematic sampling:
It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)
Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.
So, you want to have this and also partition your data into two separated data of one %80 and the other %20 of the size of original data.
You may use the following:
import pandas as pd
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
trainSize = 0.8 * len(df)
step = int(len(df)/trainSize)
train_df = systematic_sampling(df, step)
# First, concat both data frames, so the output will have some duplicates!
remaining_df = pd.concat([df, train_df])
# Then, drop those which are duplicate, it is like "df - train_df"
remaining_df = remaining_df.drop_duplicates(keep=False)
Now, in the train_df, you have %80 of the original data and in the remaining_df you have the test data.
For others reading this, it was a good reference to read about this question: Read Me!
I have a data frame with input and output columns. They have a linear relation. So, I want to remove data that does not fit this relation. My actual df is big and has many samples. Here, I am giving an example.
My code:
xdf = pd.DataFrame({'ip':[10,20,30,40],'op':[105,195,500,410]})
I am not getting any idea on how to proceed.
You can do a linear fit first then filter out the data that is outside of a certain threshold.
Sample code below:
import numpy as np
df = pd.DataFrame({'ip':[10,20,30,40],'op':[105,195,500,410]})
# do a linear fit on ip and op
f = np.polyfit(df.ip,df.op,1)
fl = np.poly1d(f)
# you will have to determine this threshold in some way
threshold = 100
output = df[(df.op - fl(df.ip)).abs()<threshold]
Another way:
You can create a boolean mask to check the ratio of op/dp is less then their mean value:
m=xdf.eval("op/ip").lt(xdf.eval("op/ip").mean())
Finally:
out=xdf[m]
plt.scatter(x=out['ip'],y=out['op'])
I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)
I am having a dataframe which I have converted to an array to model the data using a regression algorithm. I used the following code to do it
X=df.iloc[:, 0:345].values
Y=df.iloc[:,345].values
Hence X & Y are arrays now.There are many columns because, the categorical variables have been created into dummy variables. Further, I create train and test split
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
Now, after I have completed building the model and making predictions, I want to get back the value of my categorical variables (X & Y have been created after creating dummy variables for all categorical variables).For this, I am trying to convert my X_test back to a dataframe with the column names in the original dataframe df. I tried the following code
dff=df.iloc[:, 0:345]
The above statement is to get the first 345 columns (of the data frame).
Then,
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
I get the following error
ValueError: Shape of passed values is (345, 25000), indices imply (345, 100000)
I don't understand why it matters how many rows I have. I have lesser rows because my train and test have been split up 75%-25%. And I am performing the split after data is converted to an array. How do i now convert the array data into a dataframe with column names from dff dataframe?
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
X_test being a numpy.ndarray
Modified the above statement to just this:
df_new=pd.DataFrame(X_test)
df_new.columns=list(dff.columns)
The new dataframe contains the X_test data and the column names are assigned from the dff dataframe to the newly created dataframe as well.
I would recommend using the DataFrame for train_test_split, and then passing in arrays to your algorithm using numpy:
my_algorithm(np.asarray(X_train), np.asarray(y_train))
This way you can look at your data the same way you would for any df, but can run the model with the array. I'm not sure what library you are using - but I'm pretty sure some can take DataFrames now for modeling.
I am working on making predicted probability of response (binary: yes or no (1, 0)) on 60,000 dispute claims each with its unique reference ID. Using the first 3/4 of the data as training set (X_train, y_train) with logistic regression as classifier to predict the response probability of the last 1/4 as test set (X_test), I would like to make the output into 60,000 indexed series, such that the output looks like
reference_id
184932 0.531842
185362 0.401958
185361 0.105928
185338 0.018572
...
276499 0.208567
276500 0.818759
269851 0.018528
Name: response, dtype: float32
I implemented the following Python code:
y_score_lr = LogisticRegression(C=10).fit(X_train, y_train).predict_proba(X_test)[:,1]
y_proba = y_score_lr
The result is an numpy array like this
array([ 0.05225495, 0.00522493, 0.07369773, ..., 0.06994582, 0.06995239, 0.12659022])
which is an numpy array.
But I am not sure if this array actually matches the corresponding reference_id in the original X_test data frame, and I haven't figured out how to convert it into an indexed "series" like the one I mentioned at the beginning of this post.
It will be very appreciated if someone could point me to helpful shortcut to achieve this.
I also tried using
y_score_lr = LogisticRegression(C=10).fit(X_train, y_train).predict_proba(X_test)[:,1]
y_proba = y_score_lr.tolist()
to convert the array into a list, but still could not make it into the desired series-type output with 'reference_id' indexed.
Thank you.
Sincerely,
First of all, yes, it matches the values of X_test: the first row corresponds to the first value in the y_proba array.
Secondly, there are several ways you can approach this problem.
One of the possible solutions may be the following, assuming you want dtype=pandas.Series:
import pandas as pd
import numpy as np
y_proba_indexed = pd.Series(
data=y_proba, index=X_test['reference_id'], name='response', dtype=np.float32)
print(y_proba_indexed)
This would give you something like this:
84932 0.531842
185362 0.401958
185361 0.105928
185338 0.018572
....
276499 0.208567
276500 0.818759
269851 0.018528
Name: response, dtype: float32
To access, for instance, a probability referring to reference_id = 185338 you may type: y_proba_indexed.loc[[185338]], the output will be:
185338 0.018572
Name: respone, dtype: float32