Duplicate the samples in a dataset? - python

I use the code to check my dataset 'df' and see serious imbalance in column 'Has_Arrears'. I would expand my target dataset with duplicate samples under Has_Arrears = 1 35 times. i.e. sample 35 times for each observation of Has_Arrears = 1. How can I achive this? Cheers
If I would like to use stratify sampling, how can I code for this?

If I understand you correctly, this may be what you're looking for:
new = df['Has_Arrears'] == 1
a = df[new]
df = df.append([a]*35, ignore_index=True)

Related

How to retrieve rows based on mismatched condition on particular columns?

I need to do the following tasks.
I have 9 columns along with the original label. Each of those 9 columns consists of a probability value. Each 3 value is a prediction by a particular model. I have a total of 3 classifier models and there are 3 classes.
Now I have to apply the max rule.
For each class I have to pick the max probability this will give me three max values. Now I will finally return to the class which is maxed among those 3.
My code and sample
import numpy as np
df['Covid_max'] = np.where(df.columns == 'Covid',df.values,0).max(axis=1)
df['Normal_max'] = np.where(df.columns == 'Normal',df.values,0).max(axis=1)
df['Pneumonia_max'] = np.where(df.columns == 'Pneumonia',df.values,0).max(axis=1)
df['pred'] = df[['Covid_max','Normal_max','Pneumonia_max']].idxmax(axis=1)
new_label = {"pred": {"Covid_max": 0, "Normal_max": 1,"Pneumonia_max": 2,}}
df.replace(new_label , inplace = True)
Upto I have done it already. Now I got stuck. I only require the records where there is a mismatch between class and pred columns.(Here it should only print the 2nd row) How to do that?
Also, if anybody gives another solution, I would be happy to grasp that.
TIA
Try this.
df_mismatch = df.loc[~(df['Class'] == df['pred'])]

How to slice data with groupby() function?

I am doing a ML project. After preprocessing data I need to do feature extraction. In my dataset, I have 25 classes (alphabets in the datasets) and there are 20 subjects (how many times I got the alphabet) for each class. With the function groupby() they (25*20 = 500) all have the same size (1000). I want to compress 1000 sampling points to 50 sampling points by calculate maccs column mean.
My dataset looks like this:
This is what I tried but it did not work. It gives a 'SeriesGroupBy' object has no attribute 'iloc' error.
for i in np.arange(211, 890, 20):
new_dataset = new_dataset.groupby(['alphabet', 'subject'])['maccs'].iloc[i-10:i+20,6].mean(axis=0)
How can I access row and columns while using groupby() function? Or what can I use to do something similar?
import pandas as pd
alpha_df = pd.read_csv(##path to .csv file)
alpha_gb = alpha_df.groupby(['alphabet','subject'])
alpha_agg = alpha_gb.agg({
'mccs' : 'mean'
})
agg_alpha_df = alpha_agg.reset_index()
Here, I Assume you want to first categorize by alphabet and then categorize by subject column. because the order of column names in the groupby() matters.
By the way, this can be done in single line
grouped_df = alpha_df.groupby(['alphabet','subject'])['mccs'].reset_index(inplace=True)
But, the first is more explicit and adjustable.
you can look here for more aggregate operations.

Efficient way of vertically expanding a Dataframe by row and keeping the same values?

I am doing this educational challenge on kaggle https://www.kaggle.com/c/competitive-data-science-predict-future-sales
The training set is a file of daily sales numbers of some products and the test set we need to predict is the sales of similar items for the month of november.
Now I would like to use my model to make daily predictions and thus expand the test data set by 30 for each row.
I have the following code:
for row in test.itertuples():
df = pd.DataFrame(index = nov15, columns = test.columns)
df['shop_id'] = row.shop_id
df['item_category_id'] = row.item_category_id
df['item_price'] = row.item_price
df['item_id'] = row.item_id
df = df.reset_index()
df.columns = ['date', 'item_id', 'shop_id', 'item_category_id', 'item_price']
df = df[train.columns]
tt = pd.concat([tt, df])
nov15 is a pandas daterange from 1/nov/2015 to 30/nov/2015
tt is just an empty dataset I fill by expanding it by 30 rows (nov 1 to 30) for every row in the test set.
test is the original dataframe I am copying the rows from
It runs, but it takes hours to complete.
Knowing pandas and learning from previous experiences, there is probably an efficient way to do this.
Thank you for your help!
So I have found a "more" efficient way, and then someone over at Reddit's r/learnpython has told me about the correct and most efficient way.
This above dilemma is easily solved by pandas explode function.
And these two lines do what I did above, but within seconds:
test['date'] = [nov15 for _ in range(len(test))]
test = test.explode('date')
Now my more efficient way or second solution, which is in no way anywhere close to equivalent or good was to simply make 30 copies of the dataframe with a column 'date' added.

Assignment with both fillna() and loc() apparently not working

I've searched for answer around, but I cannot find them.
My goal: I'm trying to fill some missing values in a DataFrame, using supervised learning to decide how to fill it.
My code looks like this: NOTE - THIS FIRST PART IS NOT IMPORTANT, IT IS JUST TO GIVE CONTEXT
train_df = df[df['my_column'].notna()] #I need to train the model without using the missing data
train_x = train_df[['lat','long']] #Lat e Long are the inputs
train_y = train_df[['my_column']] #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y) #clf is the classifies, here we train it
df_x = df[['lat','long']] #I need this part to do the prediction
prediction = clf.predict(df_x) #clf.predict() returns an array
series_pred = pd.Series(prediction) #now the array is a series
print(series_pred.shape) #RETURNS (2381,)
print(series_pred.isna().sum()) #RETURN 0
So far, so good. I have my 2381 predictions (I need only a few of them) and there is no NaN value inside (why would there be a NaN value in the predictions? I just wanted to be sure, as I don't understand my error)
Here I try to assign the predictions to my Dataframe:
#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred #I assign the predictions using .loc()
#test_2
df['my_colum'] = df['my_colum'].fillna(series_pred) #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape) #RETURNS (2381,)
print(df['my_colum'].isna().sum()) #RETURN 6
As you can see, it didn't work: the missing values are still 6. I randomly tried a slightly different approach:
#test_3
df[['my_colum']] = df[['my_colum']].fillna(series_pred) #Will it work?
print(df[['my_colum']].shape) #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum()) #RETURNS 6
Did not work. I decided to try one last thing: check the fillna result even before assigning the results to the original df:
In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum()) #extreme test
Out[42]:
6
So... where is my very very stupid mistake? Thanks a lot
EDIT 1
To show a little bit of the data,
In[1]:
df.head()
Out[1]:
my_column lat long
id
9df Wil 51 5
4f3 Fabio 47 9
x32 Fabio 47 8
z6f Fabio 47 9
a6f Giovanni 47 7
Also, I've added info at the beginning of the question
#Ben.T or #Dan should post their own answers, they deserve to be accepted as the correct one.
Following their hints, I would say that there are two solutions:
Solution 1 (Best): Use loc()
The problem
The problem with the current solution is that df.loc[df['my_column'].isna(), 'my_column'] is expecting to receive X values, where X is the number of missing values. My variable prediction has actually both the prediction for the missing values and for the non missing values
The solution
pred_df = df[df['my_column'].isna()] #For the prediction, use a Dataframe with only the missing values. Problem solved
df_x = pred_df[['lat','long']]
prediction = clf.predict(df_x)
df.loc[df['my_column'].isna(), 'my_column'] = prediction
Solution 2: Use fillna()
The problem
The problem with the current solution is that df['my_colum'].fillna(series_pred) requires the indexes of my df to be the same of series_pred, which is impossible in this situation unless you have a simple index in your df, like [0, 1, 2, 3, 4...]
The solution
Resetting the index of the df at the very beginning of the code.
Why is this not the best
The cleanest way is to do the prediction only when you need it. This approach is easy to obtain with loc(), and I do not know how would you obtain it with fillna() because you would need to preserve the index through the classification
Edit: series_pred.index = df['my_column'].isna().index Thanks #Dan

How can I efficiently get the first x% of a DataFrame?

Say I have a DataFrame with 1000 rows. If I wish to create a series of only the first 5% (or the first 50 rows) what is the best way to do this in terms of percentages? (I don't want to simply do df.head(50))
I would like the code to able to adapt I wanted to change x to say 20% or 30%.
This should work:
your_percenteage = 5 #or 20, 30 etc
df = df.iloc[:round(len(df)/100*your_percentage)]
All you need to do is calculate the percenteage before you you call .head()
Example:
percenteage = 20
rows_to_keep = round(percenteage / 100 * len(df))
df = df.head(rows_to_keep)

Categories

Resources