my dataset is destroyed after preprocessing data.what should I do? - python

my dataset is destroyed after preprocessing data. I was finding the unique amounts and creating a data frame with them, but it doesnt show me the name of my columns. why????
my dataset is destroyed after preprocessing data.what should I do?
for j in range(X.shape[1]): #begin the first column
m=np.unique(X.iloc[:,j]) #identify the unique columns
if len(m) > 10: #the count of the rows that user wants
datasetnew1.extend(m)#collecting the unique columns
datasetnew1 = np.array(datasetnew1, dtype='object').transpose()
datasetnew1 = pd.DataFrame(datasetnew1)#creating dataset

Related

Detect the two most frequent labels in the labels variable, the other records of the dataset will be eliminated

I am working with a data set, from which I need to remove some records from a variable.
The datasets a is from the sklearn library:
from sklearn.datasets import fetch_kddcup99
Detect the two most frequent labels in the labels variable, the other records of the dataset will be eliminated.
datos = pd_data.groupby('labels').size().sort_values(ascending=False)
top = datos.head(2)
print(top)
I try to delete them this way but I can't delete them:
When looking at the dataset the other records still follow:
And I need:
If I understand your question, you want to create a dataframe containing only those records containing the two most frequent labels.
Assuming you have a list of the desired labels a you can filter the dataframe as follows:
a = ["b'neptune,'", "b'normal,'"]
dfout = df['labels].isin(a)

Iterate over pandas column and create new column

I have a pandas dataset with x number of batches (batch sizes are different, i.e rows), now I create a new feature for each batch using the respective batch data.
I want to automate this process, e.g.first create a new column then iterate over the batch id column until it has the same batch id, create new feature values and append the newly created column, then continue to next batch
here is code for the manual method for single batch
from sklearn.neighbors import BallTree
batch = samples.loc[samples['batch id'] == 'XX']
tree = BallTree(red_points[['col1','col2']], leaf_size=15, metric='minkowski')
distance, index = tree.query(batch[['col1','col2']], k=2)
batch_size = batch.shape[0]
batch['new feature'] = distance[np.arange(batch_size),batch.col3]
Since your batches are identified by batch_id you can iterate over all the unique batch_id's and add suitable entries to "new feature" column only for the currently iterating batch.
### First create an empty column
sample["new feature"] = np.nan
### iterate through all unique id's
for id in sample["batch id"].unique():
batch = samples.loc[samples["batch id"] == id]
# do your computations
samples.loc[samples["batch id"] == id, "new feature"] = # your computed value

One hot encoding when reading into tf.dataset

I am running a tensorflow model on the gcp-ai platform. The dataset is large and not everything can be kept in memory at the same time, therefore I read the data into a tf.dataset using the following code:
def read_dataset(filepattern):
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0])
features=[cols[1],cols[2]]
label = cols[0]
return features, label
# Create list of files that match pattern
file_list = tf.io.gfile.glob(filepattern)
# Create dataset from file list
dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
return dataset
training_data=read_dataset(<filepattern>)
The problem is that the second column in my data is categorical, and I need to use one hot encoding. How can this be done, either in the function decode_csv or manipulate the tf.dataset later.
You could use tf.one_hot. Assuming that the second column is cols[1] and that the categorical values have been converted to integers, you could do the following:
def decode_csv(value_column):
cols = tf.io.decode_csv(value_column, record_defaults=[[0.0],[0],[0.0]])
features=[cols[1], tf.one_hot(cols[2], nb_classes)]
label = cols[0]
return features, label
NOTE: Not tested.

MultiLabelBinarizer mixes up data when inverse transforming

I am using sklearn's multilabelbinarizer() to train multiple columns in my machine learning which I use to train my model.
After using it I noticed it was mixing up my data when it inverse transforms it. I created a test set of random values where I fit the data, transform it, and inverse_transform the data to get back to the original data.
I ran a simple test in jupyter notebook to show the error:
In the inverse_transformed value it messes up in row 1 mixing up the state and month.
jupyter notebook code
First of all, is there an error in how I use the multilabelbinarizer? Is there a different way to achieve the same output?
EDIT:
Thank you to #Nicolas M. for helping me solve my question. I ended up solving this issue like this.
Forgive the rough explanation, but it turned out to be more complicated than I originally thought. I switched to using the label_binarizer instead of the multi_label_binarizer because it
I ended up pickling the label_binarizer defaultdict so I can load it and use it in different modules for my machine learning project.
One thing that might not be trivial is me adding new headers to dataframe I make for each column. It was in the form of column_name + column number. I did this because I needed to inverse transform the data. To do that I searched for the columns that contained the original column name which separated the larger dataframe into the individual column chunks.
here some variables that I used and what they mean for reference:
lb_dict - default dict that stores the different label binarizers.
binarize_df - dataframe that stores the binarized data.
binarized_label - label binarizes one label in the column.
header - creates a new header form: column name + number column.
inverse_df - dataframe that stores the inverse_transformed data.
one_label_list - finds the list of column names with the original column tag.
one_label_df - creates a new data frame that only stores the binarized data for one column.
single_label - binarized data that gets inverse_transformed into one column.
in this code data is the dataframe that I pass to the function.
lb_dict = defaultdict(LabelBinarizer)
# create a place holder dataframe to join new binarized data to
binarize_df = pd.DataFrame(['x'] * len(data.index), columns=['place_holder'])
# loop through each column and create a binarizer and fit/transform the data
# add new data to the binarize_df dataframe
for column in data.columns.values.tolist():
lb_dict[column].fit(data[column])
binarized_label = lb_dict[column].transform(data[column])
header = [column + str(i) for i in range(0, len(binarized_label[0]))]
binarize_df = binarize_df.join(pd.DataFrame(binarized_label, columns=header))
# drop the place holder value
binarize_df.drop(labels=['place_holder'], axis=1, inplace=True)
Here is the inverse_transform function that I wrote:
inverse_df = pd.DataFrame(['x'] * len(output.index), columns=['place_holder'])
# use a for loop to run through the different output columns that need to be inverse_transformed
for column in output_cols:
# create a list of the different headers based on if the name contains the original output column name
one_label_list = [x for x in output.columns.values.tolist() if column in x]
one_label_df = output[one_label_list]
# inverse transform the data frame for one label
single_label = label_binarizer[column].inverse_transform(one_label_df.values)
# join the output of the single label df to the entire output df
inverse_df = inverse_df.join(pd.DataFrame(single_label, columns=[column]))
inverse_df.drop(labels=['place_holder'], axis=1, inplace=True)
The issue comes from the data (and in this case a bad use of the model). If you create a Dataframe of your MultiLabelBinarizer you will have :
You can see that all columns are sorted in ascending order. When you ask to reconstruct, the model will reconstruct it by "scanning" values by row.
So if you take the line one, you have :
1000 - California - January
Now if you take the second one, you have :
750 - February - New York
And so on...
So your month is swapped because of sorting order. If you replace the month by "ZFebrury", it's gonna be OK but still only by "luck"
What you should do is train 1 model per categorical feature and stack every matrix to have your final matrix. To revert it, you should extract both "sub_matrix" and do the inverse_transform.
To create 1 model per feature, you can refer to the answer of Napitupulu Jon in this SO question
EDIT 1:
I tried the code from the SO question and it doesn't work as the number of columns changed. This is what I have now (but you still have to save somewhere the column for every features)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from collections import defaultdict
data = {
"State" : ["California", "New York", "Alaska", "Arizona", "Alaska", "Arizona"],
"Month" : ["January", "February", "May", "February", "January", "February" ],
"Number" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data)
d = defaultdict(MultiLabelBinarizer) # dict of Features => model
list_encoded = [] # store single matrices
for column in df:
d[column].fit(df[column])
list_encoded.append(d[column].transform(df[column]))
merged = np.hstack(list_encoded) # matrix of 6 x 32
I hope it helps and the explaination is clear enough,
Nicolas

Pyspark - create training set and testing set from dataframe

I have a dataframe like the photo below. I would like to create a training and testing set out of it. The dataset is ordered by CustomerID and InvoiceNo. For each customer, I would like to take every row except the last 2 rows of that customer as training set, while the second to the last row of each customer would become a training set.
The result would be ideally 1 giant training set and 1 testing set. Is there an efficient way to do that with PySpark? Thanks a lot for your help in advance
You could always add an index and filter based off of that index--not sure if there's anything more efficient than that.
from pyspark.sql.window import Window
from pyspark.sql import functions as func
window = Window.partitionBy(func.col("CustomerID"))\
.orderBy(func.col("InvoiceNo").desc())
df = df.select('*', func.rank().over(window).alias('rank'))
train = df.filter("rank > 2")
test = df.filter("rank <= 2")

Categories

Resources