Working on an external dataset - python

I'm quite new to scikit-learn and was going through some of the examples of learning and predicting the samples in the iris dataset. But how do I load an external dataset for this purpose?
I downloaded a dataset that has data in the following form;
id attr1 attr2 .... label
123 0 0 ..... abc
234 0 0 ..... dsf
....
....
So how should I load this dataset in order to learn and draw prediction? Thanks.

One option is to use pandas. Assuming the data is space separated:
import pandas as pd
X = pd.read_csv('data.txt', sep=' ').values
where read_csv returns a DataFrame, and the values attribute returns a numpy array containing the data. You might want to separate out the last column of the above X as the labels, say into a one dimensional array y:
X, y = X[:, :-1], X[:, -1]

Related

Feature selection using mixed data types

I am trying to create some code that gives weight to the most impactful features.
My dataframe contains both nominal and categorical data.
example data:
[Brand] [Model] [Car_price] [...] [Prime]
BMW X1 40,000 300
The Y is the prime and X is all other columns.
I tried using the following:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv(data, delimiter=";")
#df = df.dropna(axis=1)
array = df.values
X = array[:,(6,7,9,12,13,14,15,16,17,18,19,20,21,22,23,24,25,27,34,35,37,44,45,47,48,54,61,62)]
Y = array[:,51]
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
forest.fit(X, Y)
And get the following error: ValueError: could not convert string to float
I know there is a way to transform from string into numerical data, but was wondering if it is necessary. What fixes can I apply to get weighted features?

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

Format and slice arrays with Python - prepare data for linear regression

I know this is quite a basic question, but I am struggling a bit to format a tuple properly.
I have a csv file whose head is:
id x1 x2 x3 y1 y2
1 23 45 31 2 5
2 34 5 21 3 12
3 234 4 26 4 20
....
I am building a multi target linear regression model (I will use MultiOutputRegressor from scikit learn), so I want to to split data into X (which will then be splitted in training set and test set) and target Y. I import the csv like this:
with open('data.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
so I get a list of arrays. But how do I get access to elements in the array? My X set would be all the values of the fields x1, x2, x3 (then I would select some rows of X to build Xtrain); my Y set would be all the values of y1,y2.
My final goal is something like:
X= [[23 45 31]
[34 5 21]
[234 4 26]
...]
Y=[[2,5]
[3,12]
[4,20]
...]
How can I achieve this?
Alternatively: how can I group the data structured as I said in a sparse matrix, which is a valid argument for scikit learn's linear regression function?
You can manipulate arrays with numpy:
import numpy as np
data = np.array(data) # Transform list to numpy array
data = data[1:,] # Keep all lines except the header (first line)
y_col_index = 3
X = data[:,:y_col_index] # Select the first columns
Y = data[:,y_col_index:] # Select the last columns
Victor Daplasse's answer is probably more straightforward, but I always prefer to use pandas to read and pre-process csv files.
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv')
X = np.array(data[['x1', 'x2', 'x3']])
Y = np.array(data[['y1', 'y2']])

Put a fixed quantity of missing values in a dataset - Azure ML

I'm dealing with Azure ML and my goal is to see what happens if I have a fixed quantity(in percentage) of missing values in my dataset.
My idea could be:
Starting from the dataset(take in example Adult dataset) ,duplicate the original dataset and call it for convention X. Dataset X will contain randomly missing value in the percentage of the 20%. Once we have the original dataset and the duplicated dataset X we can use a Neural Net algo , create training and test set and then train this neural net with the dataset X in input . What it could be interesting to see is the global error produced. After we can imagine to expand the range of missing values in the dataset X. Starting from 20%,after 40% and so on... I think the hardest part is to duplicate the original dataset and so create the dataset X with this missing values.
In which way I can do it? Using modules in Azure ML or maybe R/Python scripts?
Just Sharing my idea, please see the sample code & comments as below.
import numpy as np
import pandas as pd
# Origin DataFrame
df = pd.DataFrame(np.random.randn(6,4))
# Copy data via flatten data matrix as an array
array = df.values.flatten()
# insert missing data by percent
# Define the percent of missing data
percent = 0.2
size = len(array)
# generate a random list for indexing data which will be assigned NaN
chosen = np.random.choice(size, int(size*percent))
array[chosen] = np.nan
# Create a new DataFrame with missing data
df2 = pd.DataFrame(np.reshape(array, (6,4)))
Hope it helps.

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
for i in range(len(prediction)):
cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
A 1 2 3 4 5 6
B 2 3 4 5 6 7
C 1 4 2 7 8 1
...
where A,B,C are indices
Is this the correct way of using k-means?
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:
dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)
If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.
If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

Categories

Resources