pySpark Replacing Null Value on subsets of rows - python

I have a pySpark dataframe, where I have null values that I want to replace - however the value to replace with is different for different groups.
My data looks like this (appologies, I dont have a way to past it as text):
For group A I want to replace the null values with -999;
while for group B, I want to replace the null value with 0.
Currently, I split the data into sections, then do a df = df.fillna(-999) .
Is there a more efficient way of doing it? in psudo-code I was thinking something along the line of df = df.where(col('group') == A).fillna(lit(-999)).where(col('group') == B).fillna(lit(0)) but ofcourse, this doesn't work.

You can use when:
from pyspark.sql import functions as F
# Loop over all the columns you want to fill
for col in ('Col1', 'Col2', 'Col3'):
# compute here conditions to fill using a value or another
fill_a = F.col(col).isNull() & (F.col('Group') == 'A')
fill_b = F.col(col).isNull() & (F.col('Group') == 'B')
# Fill the column based on the different conditions
# using nested `when` - `otherwise`.
#
# Do not forget to add the last `otherwise` with the original
# values if none of the previous conditions have been met
filled_col = (
F.when(fill_a, -999)
.otherwise(
F.when(fill_b, 0)
.otherwise(F.col(col))
)
)
# 'overwrite' the original column with the filled column
df = df.withColumn(col, filled_col)

Another possible option is to use coalesce for each column with a "filler" column holding the replacement values:
import pyspark.sql.functions as F
for c in ['Col1', 'Col2', 'Col3']:
df = df.withColumn(c, F.coalesce(c, F.when(F.col('group') == 'A', -999)
.when(F.col('group') == 'B', 0)))

Related

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

Filtering df on entire row

I have a dataframe with lots of coded columns. I would like to filter this df where a certain code occurs in any column. I know how to filter on multiple columns, but due to the shear number of columns, it is impractical to write out each column.
E.g. if any column contains x, keep that row.
Thanks in advance
Why don't you try using a boolean mask?
value = # the code you are looking for
df = # whatever ...
mask = df[df.columns[0]] == value
for col in df.columns[1:]:
mask |= df[col] == value
df2 = df[mask]

Python Pandas imputation of Null values

I am attempting to impute Null values with an offset that corresponds to the average of the row df[row,'avg'] and average of the column ('impute[col]'). Is there a way to do this that would make the method parallelize with .map? Or is there a better way to iterate through the indexes containing Null values?
test = pd.DataFrame({'a':[None,2,3,1], 'b':[2,np.nan,4,2],
'c':[3,4,np.nan,3], 'avg':[2.5,3,3.5,2]});
df = df[['a', 'b', 'c', 'avg']];
impute = dict({'a':2, 'b':3.33, 'c':6 } )
def smarterImpute(df, impute):
df2 = df
for col in df.columns[:-1]:
for row in test.index:
if pd.isnull(df.loc[row,col]):
df2.loc[row, col] = impute[col]
+ (df.loc[:,'avg'].mean() - df.loc[row,'avg'] )
return print(df2)
smarterImpute(test, impute)
Notice that in your 'filling' expression:
impute[col] + (df.loc[:,'avg'].mean() - df.loc[row,'avg']`
The first term only depends on the column and the third only on the row; the second is just a constant. So we can create an imputation dataframe to look up whenever there's a value that needs to be filled:
impute_df = pd.DataFrame(impute, index = test.index).add(test.avg.mean() - test.avg, axis = 0)
Then, there's a method in called .combine_first() that allows you fill the NAs in one dataframe with the values of another, which is exactly what we need. We use this, and we're done:
test.combine_first(impute_df)
With pandas, you generally want to avoid using loops, and seek to make use of vectorization.

Add values to bottom of DataFrame automatically with Pandas

I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?
If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().

Read in data and set it to the index of a DataFrame with Pandas

I want to iterate through the rows of a DataFrame and assign values to a new DataFrame. I've accomplished that task indirectly like this:
#first I read the data from df1 and assign it to df2 if something happens
counter = 0 #line1
for index,row in df1.iterrows(): #line2
value = row['df1_col'] #line3
value2 = row['df1_col2'] #line4
#try unzipping a file (pseudo code)
df2.loc[counter,'df2_col'] = value #line5
counter += 1 #line6
#except
print("Error, could not unzip {}") #line7
#then I set the desired index for df2
df2 = df2.set_index(['df2_col']) #line7
Is there a way to assign the values to the index of df2 directly in line5? Sorry my original question was unclear. I'm creating an index based on the something happening.
There are a bunch of ways to do this. According to your code, all you've done is created an empty df2 dataframe with an index of values from df1.df1_col. You could do this directly like this:
df2 = pd.DataFrame([], df1.df1_col)
# ^ ^
# | |
# specifies no data, yet |
# defines the index
If you are concerned about having to filter df1 then you can do:
# cond is some boolean mask representing a condition to filter on.
# I'll make one up for you.
cond = df1.df1_col > 10
df2 = pd.DataFrame([], df1.loc[cond, 'df1_col'])
No need to iterate, you can do:
df2.index = df1['df1_col']
If you really want to iterate, save it to a list and set the index.

Categories

Resources