Iterate over pandas column and create new column - python

I have a pandas dataset with x number of batches (batch sizes are different, i.e rows), now I create a new feature for each batch using the respective batch data.
I want to automate this process, e.g.first create a new column then iterate over the batch id column until it has the same batch id, create new feature values and append the newly created column, then continue to next batch
here is code for the manual method for single batch
from sklearn.neighbors import BallTree
batch = samples.loc[samples['batch id'] == 'XX']
tree = BallTree(red_points[['col1','col2']], leaf_size=15, metric='minkowski')
distance, index = tree.query(batch[['col1','col2']], k=2)
batch_size = batch.shape[0]
batch['new feature'] = distance[np.arange(batch_size),batch.col3]

Since your batches are identified by batch_id you can iterate over all the unique batch_id's and add suitable entries to "new feature" column only for the currently iterating batch.
### First create an empty column
sample["new feature"] = np.nan
### iterate through all unique id's
for id in sample["batch id"].unique():
batch = samples.loc[samples["batch id"] == id]
# do your computations
samples.loc[samples["batch id"] == id, "new feature"] = # your computed value

Related

my dataset is destroyed after preprocessing data.what should I do?

my dataset is destroyed after preprocessing data. I was finding the unique amounts and creating a data frame with them, but it doesnt show me the name of my columns. why????
my dataset is destroyed after preprocessing data.what should I do?
for j in range(X.shape[1]): #begin the first column
m=np.unique(X.iloc[:,j]) #identify the unique columns
if len(m) > 10: #the count of the rows that user wants
datasetnew1.extend(m)#collecting the unique columns
datasetnew1 = np.array(datasetnew1, dtype='object').transpose()
datasetnew1 = pd.DataFrame(datasetnew1)#creating dataset

Split data into equal chunks. The last entry of 1st chunk shouldn't match the first entry of next

I have a dataset of IDs which have mutliple entries of data. I want to split this dataset into chunks of a fixed size. But I need all data entries pertaining to an ID to be in the same chunk.
I tried ordering by IDs and assigning an index to every row and dividing it into chunks. But I can't figure out how to keep all the entries with the same ID in the same chunk.
I can't figure out how to keep all the entries with the same ID in the same chunk.
Here I took sample dataframe to Split it into a separate chunk
To separate it in different chunks with same Id I wrote a code where I am storing the occurrence of each Id in an ascending order and then looping over it to with respect to occurrence and it is splitting dataframe into smaller chunks with similar ids
from pyspark.sql.functions import col
lm = demo_df.groupBy("Id").count().orderBy(col("Id"))
count_list = lm.rdd.map(lambda x: x[1]).collect()
per_df = demo_df
for i in count_list:
cur_df = per_df.orderBy(col("Id")).limit(i)
per_df = per_df.subtract(cur_df)
cur_df.show(truncate=False)
OUTPUT

Pytables duplicates 2.5 giga rows

I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column which is the xxhash of the text. The table consists of ~ 2.5e9 rows
I am trying to find and count the duplicates of each text entry in the table - essentially merge them into one entry, while counting the instances. I have tried doing so by indexing on the hash column and then looping through table.itersorted(hash), while keeping track of the hash value and checking for collisions - very similar to finding a duplicate in a hdf5 pytable with 500e6 rows. I did not modify the table as I was looping through it but rather wrote the merged entries to a new table - I am putting the code at the bottom.
Basically the problem I have is that the whole process takes significantly too long - it took me about 20 hours to get to iteration #5 4e5. I am working on a HDD however, so it is entirely possible the bottleneck is there. Do you see any way I can improve my code, or can you suggest another approach? Thank you in advance for any help.
P.S. I promise I am not doing anything illegal, it is simply a large scale leaked password analysis for my Bachelor Thesis.
ref = 3 #manually checked first occuring hash, to simplify the below code
gen_cnt = 0
locs = {}
print("STARTING")
for row in table.itersorted('xhashx'):
gen_cnt += 1 #so as not to flush after every iteration
ps = row['password'].decode(encoding = 'utf-8', errors = 'ignore')
if row['xhashx'] == ref:
if ps in locs:
locs[ps][0] += 1
locs[ps][1] |= row['src']
else:
locs[ps] = [1, row['src']]
else:
for p in locs:
fill_password(new_password, locs[ps]) #simply fills in the columns, with some fairly cheap statistics procedures
new_password.append()
if (gen_cnt > 100):
gen_cnt = 0
new_table.flush()
ref = row['xhashx']```
Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)
Here is a process that might be faster:
Extract a NumPy array of the hash field (column xhashx
)
Find the unique hash values
Loop thru the unique hash values and extract a NumPy array of
rows that match each value
Do your uniqueness tests against the rows in this extracted array
Write the unique rows to your new file
Code for this process below:
Note: This has been not tested, so may have small syntax or logic gaps
# Step 1: Get a Numpy array of the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Get new array with unique values only:
hash_arr_u = np.unique(hash_arr)
# Alternately, combine first 2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))
# Step 3a: Loop on rows unique hash values
for hash_test in hash_arr_u :
# Step 3b: Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_test')
# Step 4: Check for rows with unique values
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if match_row_arr.shape[0] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table
##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)
# Loop on rows in the array of unique hash values
for cnt in range(hash_arr_u.shape[0]) :
# Get an array with all rows that match this unique hash value
match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')
# Check the hash row count.
# If there is only 1 row, uniqueness tested not required
if hash_cnts[cnt] == 1 :
# only one row, so write it to new.table
else :
# check for unique rows
# then write unique rows to new.table

Replacing large dataset Multiple Conditions Loop with faster alternative in Pandas Dataframe

I'm trying to perform a nested loop onto a Dataframe but I'm encountering serious speed issues. Essentially, I have a list of unique values through which I want to loop through, all of which will need to be iterated on four different columns. The code is shown below:
def get_avg_val(temp_df, col):
temp_df = temp_df.replace(0, np.NaN)
avg_val = temp_df[col].mean()
return (0 if math.isnan(avg_val) else avg_val)
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
for unique_val in unique_list:
temp_df = Final_df[Final_df['Group_SecCode'] == unique_val]
for col in col_list:
amended_val = get_avg_val (temp_df, col)
""" The below identifies columns where Unique code is and there is an NaN - via mask; afterwards np.where replaces the value in the cell with the amended value"""
mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
Final_df[col] = np.where(mask, amended_val, Final_df[col])
The 'Mask' section specifies when two conditions are fulfilled in the dataframe and the np.where replaces the values in the cells identified with Amendend Value (which is itself a Function performing an average value).
Now this would normally work but with over 400k rows and a dozen of columns, speed is really slow. Is there any recommended way to improve on the two 'For..'? As I believe these are the reason for which the code takes some time.
Thanks all!
I am not certain if this is what you are looking for, but if your goal is to impute missing values of a series corresponding to the average value of that series in a particular group you can do this as follow:
for col in col_list:
Final_df[col] = Final_df.groupby('Group_SecCode')[col].transform(lambda x:
x.fillna(x.mean()))
UPDATE - Found an alternative way to Perform the amendments via Dictionary, with the task now taking 1.5 min rather than 35 min.
Code below. The different approach here allows for filtering the DataFrame into smaller ones, on which a series of operations are carried out. The new data is then stored into a Dictionary this time, with a loop adding more data onto it. Finally the dictionary is transferred back to the initial DataFrame, replacing it entirely with the updated dataset.
""" Creates Dataframe compatible with Factset Upload and using rows previously stored in rows_list"""
col_names = ['Group','Date','ISIN','Name','Currency','Price','Proxy Duration','Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
Final_df = pd.DataFrame(rows_list, columns=col_names)
""" Inserts extra column to identify Securities by Group type - then identifies list of unique values"""
Final_df["Group_SecCode"] = Final_df['Group'].map(str)+ "_" + Final_df['ISIN'].map(str)
unique_list = Final_df.Group_SecCode.unique().tolist()
""" The below allows for replacing missing values with averages """
col_list = ['Option Adjusted Spread','Effective Duration','Spread Duration','Effective Convexity']
""" Sets up Dictionary where to store Unique Values Dataframes"""
final_dict = {}
for unique_val in unique_list:
condition = Final_df['Group_SecCode'].isin([unique_val])
temp_df = Final_df[condition].replace(0, np.NaN)
for col in col_list:
""" Perform Amendments at Filtered Dataframe - by column """
""" 1. Replace NaN values with Median for the Datapoints encountered """
#amended_val = get_avg_val (temp_df, col) #Function previously used to compute average
#mask = (Final_df['Group_SecCode'] == unique_val) & (Final_df[col].isnull())
#Final_df[col] = np.where(mask, amended_val, Final_df[col])
amended_val = 0 if math.isnan(temp_df[col].median()) else temp_df[col].median()
mask = temp_df[col].isnull()
temp_df[col] = np.where(mask, amended_val, temp_df[col])
""" 2. Perform Validation Checks via Function defined on line 36 """
temp_df = val_checks (temp_df,col)
""" Updates Dictionary with updated data at Unique Value level """
final_dict.update(temp_df.to_dict('index')) #Updates Dictionary with Unique value Dataframe
""" Replaces entirety of Final Dataframe including amended data """
Final_df = pd.DataFrame.from_dict(final_dict, orient='index', columns=col_names)

Creating a Custom Estimator: State Mean Estimator

I've trying to develop a very simple initial model to predict the amount of fines a nursing home might expect to pay based on its location.
This is my class definition
#initial model to predict the amount of fines a nursing home might expect to pay based on its location
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin
class GroupMeanEstimator(BaseEstimator, RegressorMixin):
#defines what a group is by using grouper
#initialises an empty dictionary for group averages
def __init__(self, grouper):
self.grouper = grouper
self.group_averages = {}
#Any calculation I require for my predict method goes here
#Specifically, I want to groupby the group grouper is set by
#I want to then find out what is the mean penalty by each group
#X is the data containing the groups
#Y is fine_totals
#map each state to its mean fine_tot
def fit(self, X, y):
#Use self.group_averages to store the average penalty by group
Xy = X.join(y) #Joining X&y together
state_mean_series = Xy.groupby(self.grouper)[y.name].mean() #Creating a series of state:mean penalties
#populating a dictionary with state:mean key:value pairs
for row in state_mean_series.iteritems():
self.group_averages[row[0]] = row[1]
return self
#The amount of fine an observation is likely to receive is based on his group mean
#Want to first populate the list with the number of observations
#For each observation in the list, what is his group and then set the likely fine to his group mean.
#Return the list
def predict(self, X):
dictionary = self.group_averages
group = self.grouper
list_of_predictions = [] #initialising a list to store our return values
for row in X.itertuples(): #iterating through each row in X
prediction = dictionary[row.STATE] #Getting the value from group_averages dict using key row.group
list_of_predictions.append(prediction)
return list_of_predictions
It works for this
state_model.predict(data.sample(5))
But breaks down when I try to do this:
state_model.predict(pd.DataFrame([{'STATE': 'AS'}]))
My model can't handle the possibility, and I would like to seek help in rectifying it.
The problem I am seeing is in your fit method, iteritems basically iterates over columns rather than rows. you should use itertuples which will give you row wise data. just change the loop in your fit method to
for row in pd.DataFrame(state_mean_series).itertuples(): #row format is [STATE, mean_value]
self.group_averages[row[0]] = row[1]
and then in your predict method, just do a fail safe check by doing
prediction = dictionary.get(row.STATE, None) # None is the default value here in case the 'AS' doesn't exist. you may replace it with what ever you want

Categories

Resources