MultiLabelBinarizer mixes up data when inverse transforming - python

I am using sklearn's multilabelbinarizer() to train multiple columns in my machine learning which I use to train my model.
After using it I noticed it was mixing up my data when it inverse transforms it. I created a test set of random values where I fit the data, transform it, and inverse_transform the data to get back to the original data.
I ran a simple test in jupyter notebook to show the error:
In the inverse_transformed value it messes up in row 1 mixing up the state and month.
jupyter notebook code
First of all, is there an error in how I use the multilabelbinarizer? Is there a different way to achieve the same output?
EDIT:
Thank you to #Nicolas M. for helping me solve my question. I ended up solving this issue like this.
Forgive the rough explanation, but it turned out to be more complicated than I originally thought. I switched to using the label_binarizer instead of the multi_label_binarizer because it
I ended up pickling the label_binarizer defaultdict so I can load it and use it in different modules for my machine learning project.
One thing that might not be trivial is me adding new headers to dataframe I make for each column. It was in the form of column_name + column number. I did this because I needed to inverse transform the data. To do that I searched for the columns that contained the original column name which separated the larger dataframe into the individual column chunks.
here some variables that I used and what they mean for reference:
lb_dict - default dict that stores the different label binarizers.
binarize_df - dataframe that stores the binarized data.
binarized_label - label binarizes one label in the column.
header - creates a new header form: column name + number column.
inverse_df - dataframe that stores the inverse_transformed data.
one_label_list - finds the list of column names with the original column tag.
one_label_df - creates a new data frame that only stores the binarized data for one column.
single_label - binarized data that gets inverse_transformed into one column.
in this code data is the dataframe that I pass to the function.
lb_dict = defaultdict(LabelBinarizer)
# create a place holder dataframe to join new binarized data to
binarize_df = pd.DataFrame(['x'] * len(data.index), columns=['place_holder'])
# loop through each column and create a binarizer and fit/transform the data
# add new data to the binarize_df dataframe
for column in data.columns.values.tolist():
lb_dict[column].fit(data[column])
binarized_label = lb_dict[column].transform(data[column])
header = [column + str(i) for i in range(0, len(binarized_label[0]))]
binarize_df = binarize_df.join(pd.DataFrame(binarized_label, columns=header))
# drop the place holder value
binarize_df.drop(labels=['place_holder'], axis=1, inplace=True)
Here is the inverse_transform function that I wrote:
inverse_df = pd.DataFrame(['x'] * len(output.index), columns=['place_holder'])
# use a for loop to run through the different output columns that need to be inverse_transformed
for column in output_cols:
# create a list of the different headers based on if the name contains the original output column name
one_label_list = [x for x in output.columns.values.tolist() if column in x]
one_label_df = output[one_label_list]
# inverse transform the data frame for one label
single_label = label_binarizer[column].inverse_transform(one_label_df.values)
# join the output of the single label df to the entire output df
inverse_df = inverse_df.join(pd.DataFrame(single_label, columns=[column]))
inverse_df.drop(labels=['place_holder'], axis=1, inplace=True)

The issue comes from the data (and in this case a bad use of the model). If you create a Dataframe of your MultiLabelBinarizer you will have :
You can see that all columns are sorted in ascending order. When you ask to reconstruct, the model will reconstruct it by "scanning" values by row.
So if you take the line one, you have :
1000 - California - January
Now if you take the second one, you have :
750 - February - New York
And so on...
So your month is swapped because of sorting order. If you replace the month by "ZFebrury", it's gonna be OK but still only by "luck"
What you should do is train 1 model per categorical feature and stack every matrix to have your final matrix. To revert it, you should extract both "sub_matrix" and do the inverse_transform.
To create 1 model per feature, you can refer to the answer of Napitupulu Jon in this SO question
EDIT 1:
I tried the code from the SO question and it doesn't work as the number of columns changed. This is what I have now (but you still have to save somewhere the column for every features)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from collections import defaultdict
data = {
"State" : ["California", "New York", "Alaska", "Arizona", "Alaska", "Arizona"],
"Month" : ["January", "February", "May", "February", "January", "February" ],
"Number" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data)
d = defaultdict(MultiLabelBinarizer) # dict of Features => model
list_encoded = [] # store single matrices
for column in df:
d[column].fit(df[column])
list_encoded.append(d[column].transform(df[column]))
merged = np.hstack(list_encoded) # matrix of 6 x 32
I hope it helps and the explaination is clear enough,
Nicolas

Related

Detect the two most frequent labels in the labels variable, the other records of the dataset will be eliminated

I am working with a data set, from which I need to remove some records from a variable.
The datasets a is from the sklearn library:
from sklearn.datasets import fetch_kddcup99
Detect the two most frequent labels in the labels variable, the other records of the dataset will be eliminated.
datos = pd_data.groupby('labels').size().sort_values(ascending=False)
top = datos.head(2)
print(top)
I try to delete them this way but I can't delete them:
When looking at the dataset the other records still follow:
And I need:
If I understand your question, you want to create a dataframe containing only those records containing the two most frequent labels.
Assuming you have a list of the desired labels a you can filter the dataframe as follows:
a = ["b'neptune,'", "b'normal,'"]
dfout = df['labels].isin(a)

Looping a function over huge dataset

A (already defined) function takes ISIN (unique identifier in finance) as input and gets the corresponding RIC (another identifier) as output by looking at a particular internal web app where this data is available in tabular form. The Key limitation of this website is that it can't take more than 500 input ID at a time. So when 500 or less number of ISINs are entered as input it returns a dataframe containing 500 input ISIN and their corresponding RIC codes from the website.
Task is to take a csv as input containing 30k ISINs and batch them in group of 500 IDs so that it can pass through the function and then store the produced output (dataframe). Keep looping input and appending output incrementally.
Can someone please suggest how to break this data of 30K into size of 500 and then loop through function and store all results? Many thanks in advance!
.iloc is the method you want to use.
df.iloc[firstrow:lastrow , firstcol:lastcol]
if you put it in a for loop such as
for x in range (0, 30000, 50):
a = x #first row
b = x+50 #last row
thisDF = bigdf.iloc[a:b , firstcol:lastcol]
Try it and implement it in your code. You should make questions with some code you tried, so you get helped better.
Assuming you read in the .csv file as a pandas Series (e.g., by using something like this: pd.Series.from_csv('ISINs.csv')) or that you have a list, you could split these up as thus:
import pandas as pd
import numpy as np
# mock series of ISINs
isins = pd.Series(np.arange(0, 30002, 1))
data = pd.DataFrame()
for i in range(0, len(isins)//500):
isins_for_function = isins.iloc[i*500: i*500+500]
# if you have a list instead of a series, you will need to split it like this instead
isins_for_function = isins[i*500: i*500+500]
df = func(isins_for_function)
data = pd.concat([data, df])
# for Series
df = func(isins.iloc[-(len(isins)%500):]
# for list
df = func(isins[-(len(isins)%500):]
data = pd.concat([data, df])
This will concatenate your dataframes together into data. isins is your Series/list of isins. You will need the last bit after the for loop for any index values that are after the last chunk of 500 (in the Series above, which has 30002 rows, the last two are not included in the chunks of 500 so still need to be entered into the function func).

Concatenating and appending to a DataFrame inside the For Loop in Python

I have the following problem.
There is quite big dataset with the features and IDs. Due to the task definition, I'm trying to do clustering but not for all dataset, instead of that I'm taking each of the IDs and then train the model on the feature data from this particular ID. How does that look in details:
Imagine, that we have our initial dataframe df_init
Then I create the array with unique ID_s:
dd = df_init['ID'].unique()
After that, set comprehension is being created just like that:
dds = {x:y for x,y in df_init.groupby('ID')}
Using for loops and iterating over dds, I'm taking the data and use it for training the clustering algorithm. After that, pd.concat() is using to get the dataframe back(for this example, will show only two IDs):
df = pd.DataFrame()
d={}
n=5
for i in dd[:2]:
d[i] = dds[i].iloc[: , 1:5].values
ac = AgglomerativeClustering(n_clusters=n, linkage='complete').fit(d[i])
labels = ac.labels_
labels = pd.DataFrame(labels)
df = pd.concat([df, labels])
print(i)
print('Labels: ', labels)
So the result for this loop will be following output:
And the output df will look like that(shown only for first ID, the rest labels are also there):
My question is the following: how can I add the additional column to this dataframe in the loop, that will be matching certain ID to corresponding labels (4 labels-ID_1, another 4 labels-ID_2, etc.)? Are there any pandas solution for achieving that?
Many thanks in advance!
Below this line:
labels = pd.DataFrame(labels)
Add the following:
labels['ID']=i
This will give you the extra column with the proper ID for each subset

Only a fraction of the dataframe is merged in pandas - python

My problem is simple. I have a pandas dataframe with 124957 different tweets (related to a center-topic). The problem is that each date has more than 1 tweet (around 300 per day).
My goal is to perform sentiment analysis on the tweets of each day. In order to solve this, I am trying to combine all tweets of the same day into one string (which corresponds to each date).
To achieve this, I have tried the following:
indx=0
get_tweet=""
for i in range(0,len(cdata)-1):
get_date=cdata.date.iloc[i]
next_date=cdata.date.iloc[i+1]
if(str(get_date)==str(next_date)):
get_tweet=get_tweet+cdata.text.iloc[i]+" "
if(str(get_date)!=str(next_date)):
cdata.loc[indx,'date'] = get_date
cdata.loc[indx,'text'] = get_tweet
indx=indx+1
get_tweet=" "
df.to_csv("/home/development-pc/Documents/BTC_Tweets_1Y.csv")
My problem is that only a small sample of the data is actually converted to my format of choice.
Image of the dataframe
I do not know whether it is of importance, but the dataframe consists of three separate datasets that I combined into one using "pd.concat". After that, I sorted the newly created dataframe by date (ascending order) and reset the index as it was reversed (last input (2020-01-03) = 0 and first input (2019-01-01) = 124958).
Thanks in advance,
Filippos
Without going into the loop you used (think you are only concatating two first instances, not sure) you could use groupby and apply, here is an example:
# create some random data for example
import pandas as pd
import random
import string
dates = random.choices(pd.date_range(pd.Timestamp(2020,1,1), pd.Timestamp(2020,1,6)),k=11)
letters = string.ascii_lowercase
texts = [' '.join([''.join(random.choices(letters, k=random.randrange(2,10))) for x in
range(random.randrange(3,12))]) for x in range(11)]
df = pd.DataFrame({'date':dates, 'text':texts})
# group
pd.DataFrame(df.groupby('date').apply(lambda g: ' '.join(g['text'])))

Joining List to a Pandas Frame - Have I kept the order?

So I have 2 scripts for an Artificial Neural Network on insurance claims - one script is to train/test and one to execute going forward. I am done with the first one and developing the second one using real production data as a test of it. The target/class label is a binary 1 or 0. Input data is initially in a dataframe of shape (5914, 23) and it is all numeric data. I then do a df.values.tolist() on it, I do StandardScaler() on all values (other than the first one which is a Claim ID) and in the process, it goes through np.asarray. I then run it through ANN_Model.Predict_Proba which gives me a list of 5,914 pairs of probabilities. Now I want to merge back to the dataframe which I had before I did the tolist(), all of the probabilities (called "predicted_probs") and to do so into a new column on that original dataframe (column called "Results") and to do so for one class (I am only interested in the positive class). I do so via the following code. But I don't know if the order of my results is the same as the order of the dataframe. Is it?
for i in range (0,len(predicted_probs)):
original_df["Results"] = pd.Series(predicted_probs[i])
print (predicted_probs[[i],[1]])
Should I be doing it another way? I have to replicate what is done in the training script in order to expect like-for-like results, hence the StandardScaler(), np.asarray etc.
Thanks in advance
Your dataframe's shape is (5914, 23) and the output from ann_model.predict_proba is 5914. Since a row from your df will output a single probability you can expect that the order of your results is the same as the order of your dataframe. To add the probability of the positive class to the dataframe,
original_df['Results'] = [i[1] for i in predicted_probs]
There is no need for you to loop through the predicted_probs

Categories

Resources