I have a dataframe like the photo below. I would like to create a training and testing set out of it. The dataset is ordered by CustomerID and InvoiceNo. For each customer, I would like to take every row except the last 2 rows of that customer as training set, while the second to the last row of each customer would become a training set.
The result would be ideally 1 giant training set and 1 testing set. Is there an efficient way to do that with PySpark? Thanks a lot for your help in advance
You could always add an index and filter based off of that index--not sure if there's anything more efficient than that.
from pyspark.sql.window import Window
from pyspark.sql import functions as func
window = Window.partitionBy(func.col("CustomerID"))\
.orderBy(func.col("InvoiceNo").desc())
df = df.select('*', func.rank().over(window).alias('rank'))
train = df.filter("rank > 2")
test = df.filter("rank <= 2")
Related
I am working with a data set, from which I need to remove some records from a variable.
The datasets a is from the sklearn library:
from sklearn.datasets import fetch_kddcup99
Detect the two most frequent labels in the labels variable, the other records of the dataset will be eliminated.
datos = pd_data.groupby('labels').size().sort_values(ascending=False)
top = datos.head(2)
print(top)
I try to delete them this way but I can't delete them:
When looking at the dataset the other records still follow:
And I need:
If I understand your question, you want to create a dataframe containing only those records containing the two most frequent labels.
Assuming you have a list of the desired labels a you can filter the dataframe as follows:
a = ["b'neptune,'", "b'normal,'"]
dfout = df['labels].isin(a)
I have the following problem.
There is quite big dataset with the features and IDs. Due to the task definition, I'm trying to do clustering but not for all dataset, instead of that I'm taking each of the IDs and then train the model on the feature data from this particular ID. How does that look in details:
Imagine, that we have our initial dataframe df_init
Then I create the array with unique ID_s:
dd = df_init['ID'].unique()
After that, set comprehension is being created just like that:
dds = {x:y for x,y in df_init.groupby('ID')}
Using for loops and iterating over dds, I'm taking the data and use it for training the clustering algorithm. After that, pd.concat() is using to get the dataframe back(for this example, will show only two IDs):
df = pd.DataFrame()
d={}
n=5
for i in dd[:2]:
d[i] = dds[i].iloc[: , 1:5].values
ac = AgglomerativeClustering(n_clusters=n, linkage='complete').fit(d[i])
labels = ac.labels_
labels = pd.DataFrame(labels)
df = pd.concat([df, labels])
print(i)
print('Labels: ', labels)
So the result for this loop will be following output:
And the output df will look like that(shown only for first ID, the rest labels are also there):
My question is the following: how can I add the additional column to this dataframe in the loop, that will be matching certain ID to corresponding labels (4 labels-ID_1, another 4 labels-ID_2, etc.)? Are there any pandas solution for achieving that?
Many thanks in advance!
Below this line:
labels = pd.DataFrame(labels)
Add the following:
labels['ID']=i
This will give you the extra column with the proper ID for each subset
I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you
General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)
Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps
So I have 2 scripts for an Artificial Neural Network on insurance claims - one script is to train/test and one to execute going forward. I am done with the first one and developing the second one using real production data as a test of it. The target/class label is a binary 1 or 0. Input data is initially in a dataframe of shape (5914, 23) and it is all numeric data. I then do a df.values.tolist() on it, I do StandardScaler() on all values (other than the first one which is a Claim ID) and in the process, it goes through np.asarray. I then run it through ANN_Model.Predict_Proba which gives me a list of 5,914 pairs of probabilities. Now I want to merge back to the dataframe which I had before I did the tolist(), all of the probabilities (called "predicted_probs") and to do so into a new column on that original dataframe (column called "Results") and to do so for one class (I am only interested in the positive class). I do so via the following code. But I don't know if the order of my results is the same as the order of the dataframe. Is it?
for i in range (0,len(predicted_probs)):
original_df["Results"] = pd.Series(predicted_probs[i])
print (predicted_probs[[i],[1]])
Should I be doing it another way? I have to replicate what is done in the training script in order to expect like-for-like results, hence the StandardScaler(), np.asarray etc.
Thanks in advance
Your dataframe's shape is (5914, 23) and the output from ann_model.predict_proba is 5914. Since a row from your df will output a single probability you can expect that the order of your results is the same as the order of your dataframe. To add the probability of the positive class to the dataframe,
original_df['Results'] = [i[1] for i in predicted_probs]
There is no need for you to loop through the predicted_probs
I am using sklearn's multilabelbinarizer() to train multiple columns in my machine learning which I use to train my model.
After using it I noticed it was mixing up my data when it inverse transforms it. I created a test set of random values where I fit the data, transform it, and inverse_transform the data to get back to the original data.
I ran a simple test in jupyter notebook to show the error:
In the inverse_transformed value it messes up in row 1 mixing up the state and month.
jupyter notebook code
First of all, is there an error in how I use the multilabelbinarizer? Is there a different way to achieve the same output?
EDIT:
Thank you to #Nicolas M. for helping me solve my question. I ended up solving this issue like this.
Forgive the rough explanation, but it turned out to be more complicated than I originally thought. I switched to using the label_binarizer instead of the multi_label_binarizer because it
I ended up pickling the label_binarizer defaultdict so I can load it and use it in different modules for my machine learning project.
One thing that might not be trivial is me adding new headers to dataframe I make for each column. It was in the form of column_name + column number. I did this because I needed to inverse transform the data. To do that I searched for the columns that contained the original column name which separated the larger dataframe into the individual column chunks.
here some variables that I used and what they mean for reference:
lb_dict - default dict that stores the different label binarizers.
binarize_df - dataframe that stores the binarized data.
binarized_label - label binarizes one label in the column.
header - creates a new header form: column name + number column.
inverse_df - dataframe that stores the inverse_transformed data.
one_label_list - finds the list of column names with the original column tag.
one_label_df - creates a new data frame that only stores the binarized data for one column.
single_label - binarized data that gets inverse_transformed into one column.
in this code data is the dataframe that I pass to the function.
lb_dict = defaultdict(LabelBinarizer)
# create a place holder dataframe to join new binarized data to
binarize_df = pd.DataFrame(['x'] * len(data.index), columns=['place_holder'])
# loop through each column and create a binarizer and fit/transform the data
# add new data to the binarize_df dataframe
for column in data.columns.values.tolist():
lb_dict[column].fit(data[column])
binarized_label = lb_dict[column].transform(data[column])
header = [column + str(i) for i in range(0, len(binarized_label[0]))]
binarize_df = binarize_df.join(pd.DataFrame(binarized_label, columns=header))
# drop the place holder value
binarize_df.drop(labels=['place_holder'], axis=1, inplace=True)
Here is the inverse_transform function that I wrote:
inverse_df = pd.DataFrame(['x'] * len(output.index), columns=['place_holder'])
# use a for loop to run through the different output columns that need to be inverse_transformed
for column in output_cols:
# create a list of the different headers based on if the name contains the original output column name
one_label_list = [x for x in output.columns.values.tolist() if column in x]
one_label_df = output[one_label_list]
# inverse transform the data frame for one label
single_label = label_binarizer[column].inverse_transform(one_label_df.values)
# join the output of the single label df to the entire output df
inverse_df = inverse_df.join(pd.DataFrame(single_label, columns=[column]))
inverse_df.drop(labels=['place_holder'], axis=1, inplace=True)
The issue comes from the data (and in this case a bad use of the model). If you create a Dataframe of your MultiLabelBinarizer you will have :
You can see that all columns are sorted in ascending order. When you ask to reconstruct, the model will reconstruct it by "scanning" values by row.
So if you take the line one, you have :
1000 - California - January
Now if you take the second one, you have :
750 - February - New York
And so on...
So your month is swapped because of sorting order. If you replace the month by "ZFebrury", it's gonna be OK but still only by "luck"
What you should do is train 1 model per categorical feature and stack every matrix to have your final matrix. To revert it, you should extract both "sub_matrix" and do the inverse_transform.
To create 1 model per feature, you can refer to the answer of Napitupulu Jon in this SO question
EDIT 1:
I tried the code from the SO question and it doesn't work as the number of columns changed. This is what I have now (but you still have to save somewhere the column for every features)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from collections import defaultdict
data = {
"State" : ["California", "New York", "Alaska", "Arizona", "Alaska", "Arizona"],
"Month" : ["January", "February", "May", "February", "January", "February" ],
"Number" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data)
d = defaultdict(MultiLabelBinarizer) # dict of Features => model
list_encoded = [] # store single matrices
for column in df:
d[column].fit(df[column])
list_encoded.append(d[column].transform(df[column]))
merged = np.hstack(list_encoded) # matrix of 6 x 32
I hope it helps and the explaination is clear enough,
Nicolas