How can I replace age binning feature into numerical data?

How can I replace age binning feature into numerical data? - python

I have created agebin column from age columns. I have range of ages but how can I convert them into agebin numerical data type because I want to check whether agebin is important feature or not.
I tried following code for age binning:
traindata = data.assign(age_bins = pd.cut(data.age, 4, retbins=False, include_lowest=True))
data['agebin'] = traindata['age_bins']
data['agebin'].unique()
[[16.954, 28.5], (28.5, 40], (40, 51.5], (51.5, 63]]
Categories (4, object): [[16.954, 28.5] < (28.5, 40] < (40, 51.5] < (51.5, 63]]
What I tried :
data['enc_agebin'] = data.agebin.map({[16.954, 28.5]:1,(28.5, 40]:2,(40, 51.5]:3,(51.5, 63]:4})
I tried to map each range and convert it to numerical but I am getting syntax error. Please suggest some good technique for converting agebin which is categorical to numerical data.

I think need parameter labels in cut:
data = pd.DataFrame({'age':[10,20,40,50,44,56,12,34,56]})
data['agebin'] = pd.cut(data.age,bins=4,labels=range(1, 5), retbins=False,include_lowest=True)
print (data)
age agebin
0 10 1
1 20 1
2 40 3
3 50 4
4 44 3
5 56 4
6 12 1
7 34 3
8 56 4
Or use labels=False, then first bin is 0 and last 3 (like range(4)):
data['agebin'] = pd.cut(data.age, bins=4, labels=False, retbins=False, include_lowest=True)
print (data)
age agebin
0 10 0
1 20 0
2 40 2
3 50 3
4 44 2
5 56 3
6 12 0
7 34 2
8 56 3

Related

Pandas Group By With Overlapping Bins

I want to sum up data across overlapping bins. Basically the question here but instead of the bins being (0-8 years old), (9 - 17 years old), (18-26 years old), (27-35 years old), and (26 - 44 years old) I want them to be (0-8 years old), (1 - 9 years old), (2-10 years old), (3-11 years old), and (4 - 12 years old).
Starting with a df like this
id
awards
age
1
100
24
1
150
26
1
50
54
2
193
34
2
209
50
I am using the code from this answer to calculate summation across non-overlapping bins.
bins = [9 * i for i in range(0, df['age'].max() // 9 + 2)]
cuts = pd.cut(df['age'], bins, right=False)
print(cuts)
0 [18, 27)
1 [18, 27)
2 [54, 63)
3 [27, 36)
4 [45, 54)
Name: age, dtype: category
Categories (7, interval[int64, left]): [[0, 9) < [9, 18) < [18, 27) < [27, 36) < [36, 45) < [45, 54) < [54, 63)]
df_out = (df.groupby(['id', cuts])
.agg(total_awards=('awards', 'sum'))
.reset_index(level=0)
.reset_index(drop=True)
)
df_out['age_interval'] = df_out.groupby('id').cumcount()
Result
print(df_out)
id total_awards age_interval
0 1 0 0
1 1 0 1
2 1 250 2
3 1 0 3
4 1 0 4
5 1 0 5
6 1 50 6
7 2 0 0
8 2 0 1
9 2 0 2
10 2 193 3
11 2 0 4
12 2 209 5
13 2 0 6
Is it possible to work off the existing code to do this with overlapping bins?

First pivot_table your data to get a row per id and the columns being the ages. then reindex to get all the ages possible, from 0 to at least the max in the column age (here I use the max plus the interval length). Now you can use rolling along the columns. Rename the columns to create meaningful names. Finally stack and reset_index to get a dataframe with the expected shape.
interval = 9 #include both bounds like 0 and 8 for the first interval
res = (
df.pivot_table(index='id', columns='age', values='awards',
aggfunc=sum, fill_value=0)
.reindex(columns=range(0, df['age'].max()+interval), fill_value=0)
.rolling(interval, axis=1, min_periods=interval).sum()
.rename(columns=lambda x: f'{x-interval+1}-{x} y.o.')
.stack()
.reset_index(name='awards')
)
and you get with the input data provided in the question
print(res)
# id age awards
# 0 1 0-8 y.o. 0.0
# 1 1 1-9 y.o. 0.0
# ...
# 15 1 15-23 y.o. 0.0
# 16 1 16-24 y.o. 100.0
# 17 1 17-25 y.o. 100.0
# 18 1 18-26 y.o. 250.0
# 19 1 19-27 y.o. 250.0
# 20 1 20-28 y.o. 250.0
# 21 1 21-29 y.o. 250.0
# 22 1 22-30 y.o. 250.0
# 23 1 23-31 y.o. 250.0
# 24 1 24-32 y.o. 250.0
# 25 1 25-33 y.o. 150.0
# 26 1 26-34 y.o. 150.0
# 27 1 27-35 y.o. 0.0
# ...
# 45 1 45-53 y.o. 0.0
# 46 1 46-54 y.o. 50.0
# 47 1 47-55 y.o. 50.0
# 48 1 48-56 y.o. 50.0
# 49 1 49-57 y.o. 50.0
# ...

I think the best would be to first compute per-age sums, and then a rolling window to get all 9 year intervals. This only works because all your intervals have the same size − otherwise it would be much harder.
>>> totals = df.groupby('age')['awards'].sum()
>>> totals = totals.reindex(np.arange(0, df['age'].max() + 9)).fillna(0, downcast='infer')
>>> totals
0 6
1 2
2 4
3 6
4 4
..
98 0
99 0
100 0
101 0
102 0
Name: age, Length: 103, dtype: int64
>>> totals.rolling(9).sum().dropna().astype(int).rename(lambda age: f'{age-8}-{age}')
0-8 42
1-9 43
2-10 45
3-11 47
4-12 47
..
90-98 31
91-99 27
92-100 20
93-101 13
94-102 8
Name: age, Length: 95, dtype: int64
This is slightly complicated by the fact you also want to group by id, but the idea stays the same:
>>> idx = pd.MultiIndex.from_product([df['id'].unique(), np.arange(0, df['age'].max() + 9)], names=['id', 'age'])
>>> totals = df.groupby(['id', 'age']).sum().reindex(idx).fillna(0, downcast='infer')
>>> totals
awards
1 0 128
1 204
2 136
3 367
4 387
... ...
2 98 0
99 0
100 0
101 0
102 0
[206 rows x 1 columns]
>>> totals.groupby('id').rolling(9).sum().droplevel(0).dropna().astype(int).reset_index('id')
id awards
age
8 1 3112
9 1 3390
10 1 3431
11 1 3609
12 1 3820
.. .. ...
98 2 1786
99 2 1226
100 2 900
101 2 561
102 2 317
[190 rows x 2 columns]
This is the same as #Ben.T’s answer except we keep the Series shape and his answer pivots it to a dataframe. At any step you could .stack('age') or .unstack('age') to switch between both answer’s formats.

IIUC, you can use pd.IntervalIndex with some list comprehension:
ii = pd.IntervalIndex.from_tuples(
[
(s, e)
for e, s in pd.Series(np.arange(51)).rolling(9).agg(min).dropna().iteritems()
]
)
df_out = pd.concat(
[
pd.Series(ii.contains(x["age"]) * x["awards"], index=ii)
for i, x in df[["age", "awards"]].iterrows()
],
axis=1,
).groupby(level=0).sum().T
df_out.stack()
Output:
0 (0.0, 8.0] 0
(1.0, 9.0] 0
(2.0, 10.0] 0
(3.0, 11.0] 0
(4.0, 12.0] 0
...
4 (38.0, 46.0] 0
(39.0, 47.0] 0
(40.0, 48.0] 0
(41.0, 49.0] 0
(42.0, 50.0] 209
Length: 215, dtype: int64

A old way without pd.cut using a for loop and some masks.
import pandas as pd
max_age = df["age"].max()
interval_length = 8
values = []
for min_age in range(max_age - interval_length + 1):
max_age = min_age + interval_length
awards = df.query("#min_age <= age <= #max_age").loc[:, "age"].sum()
values.append([min_age, max_age, awards])
df_out = pd.DataFrame(values, columns=["min_age", "max_age", "awards"])
Let me know if this is what you want :)

Let df be a DataFrame:
import pandas as pd
import random
def r(b, e):
return [random.randint(b, e) for _ in range(300)]
df = pd.DataFrame({'id': r(1, 3), 'awards': r(0, 400), 'age': r(1, 99)})
For binning by age, I would advise creating a new column since it is clearer (and faster):
df['bin'] = df['age'].apply(lambda x: x // 9)
print(df)
The number of awards per id per bin can be obtained using simply:
totals_separate = df.groupby(['id', 'bin'])['awards'].sum()
print(totals_separate)
If I understand correctly, you would like the sum for each window of size 9 rows:
totals_rolling = df.groupby(['id', 'bin'])['awards'].rolling(9, min_periods=1).sum()
print(totals_rolling)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html

Creating a data frame named after values from another data frame

I have a data frame containing three columns, whereas col_1 and col_2 are containing some arbitrary data:
data = {"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)}
df = pd.DataFrame(data)
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
and another data frame containing height values, that should be used to segment the Height column from the df.
data_segments = {"Section Height" : [1, 10, 20]}
df_segments = pd.DataFrame(data_segments)
Section Height
0 1
1 10
2 20
I want to create two new data frames, df_segment_0 containing all columns of the initial df but only for Height rows within the first two indices in the df_segments. The same approach should be taken for the df_segment_1. They should look like:
df_segment_0
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
df_segment_1
Height Col_1 Col_2
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
I tried the following code using the .loc method and added the suggestion of C Hecht to create a list of data frames:
df_segment_list = []
try:
for index in df_segments.index:
df_segment = df[["Height", "Col_1", "Col_2"]].loc[(df["Height"] >= df_segments["Section Height"][index]) & (df["Height"] < df_segments["Section Height"][index + 1])]
df_segment_list.append(df_segment)
except KeyError:
pass
Try-except is used only to ignore the error for the last name entry since there is no height for index=2. The data frames in this list can be accessed as C Hecht:
df_segment_0 = df_segment_list[0]
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
However, I would like to automate the naming of the final data frames. I tried:
for i in range(0, len(df_segment_list)):
name = "df_segment_" + str(i)
name = df_segment_list[i]
I expect that this code to simply automate the df_segment_0 = df_segment_list[0], instead I receive an error name 'df_segment_0' is not defined.
The reason I need separate data frames is that I will perform many subsequent operations using Col_1 and Col_2, so I need row-wise access to each one of them, for example:
df_segment_0 = df_segment_0 .assign(col_3 = df_segment_0 ["Col_1"] / df_segment_0 ["Col_2"])
How do I achieve this?
EDIT 1: Clarified question with the suggestion from C Hecht.

If you want to get all entries that are smaller than the current segment height in your segmentation data frame, here you go :)
import pandas as pd
df1 = pd.DataFrame({"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)})
df_segments = pd.DataFrame({"Section Height": [1, 10, 20]})
def segment_data_frame(data_frame: pd.DataFrame, segmentation_plan: pd.DataFrame):
df = data_frame.copy() # making a safety copy because we mutate the df !!!
for sh in segmentation_plan["Section Height"]: # sh is the new maximum "Height"
df_new = df[df["Height"] < sh] # select all entries that match the maximum "Height"
df.drop(df_new.index, inplace=True) # remove them from the original DataFrame
yield df_new
# ATTENTION: segment_data_frame() will calculate each segment at runtime!
# So if you don't want to iterate over it but rather have one list to contain
# them all, you must use list(segment_data_frame(...)) or [x for x in segment_data_frame(...)]
for segment in segment_data_frame(df1, df_segments):
print(segment)
print()
print(list(segment_data_frame(df1, df_segments)))
If you want to execute certain steps on those steps you can just use the defined list like so:
for segment in segment_data_frame(df1, df_segments):
do_stuff_with(segment)
If you want to keep track and name the individual frames, you can use a dictionary

Unfortunately I don't 100% understand what you have in mind, but I hope that the following should help you in finding the answer:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Section Height': [20, 90, 111, 232, 252, 3383, 3768, 3826, 3947, 4100], 'df_names': [f'df_section_{i}' for i in range(10)]})
df['shifted'] = df['Section Height'].shift(-1)
new_dfs = []
for index, row in df.iterrows():
if np.isnan(row['shifted']):
# Don't know what you want to do here
pass
else:
new_df = pd.DataFrame({'heights': [i for i in range(int(row['Section Height']), int(row['shifted']))]})
new_df.name = row['df_names']
new_dfs.append(new_df)
The content of new_dfs are dataframes that look like this:
heights
0 20
1 21
2 22
3 23
4 24
.. ...
65 85
66 86
67 87
68 88
69 89
[70 rows x 1 columns]
If you clarify your questions given this input, we could help you all the way, but this should hopefully point you in the right direction.
Edit: A small comment on using df.name: This is not really stable and if you do stuff like dropping a column, pickling/unpickling, etc. the name will likely be lost. But you can surely find a good solution to maintain the name depending on your needs.

Creation dataframe from several list of lists

I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)

Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

In Pandas, how to operate between columns in max perfornace

I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you

I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200

Fill multiple nulls for categorical data

I'm wondering if there is a pythonic way to fill nulls for categorical data by randomly choosing from the distribution of unique values. Basically proportionally / randomly filling categorical nulls based on the existing distribution of the values in the column...
-- below is an example of what I'm already doing
--I'm using numbers as categories to save time, I'm not sure how to randomly input letters
import numpy as np
import pandas as pd
np.random.seed([1])
df = pd.DataFrame(np.random.normal(10, 2, 20).round().astype(object))
df.rename(columns = {0 : 'category'}, inplace = True)
df.loc[::5] = np.nan
print df
category
0 NaN
1 12
2 4
3 9
4 12
5 NaN
6 10
7 12
8 13
9 9
10 NaN
11 9
12 10
13 11
14 9
15 NaN
16 10
17 4
18 9
19 9
This is how I'm currently inputting the values
df.category.value_counts()
9 6
12 3
10 3
4 2
13 1
11 1
df.category.value_counts()/16
9 0.3750
12 0.1875
10 0.1875
4 0.1250
13 0.0625
11 0.0625
# to fill categorical info based on percentage
category_fill = np.random.choice((9, 12, 10, 4, 13, 11), size = 4, p = (.375, .1875, .1875, .1250, .0625, .0625))
df.loc[df.category.isnull(), "category"] = category_fill
Final output works, just takes a while to write
df.category.value_counts()
9 9
12 4
10 3
4 2
13 1
11 1
Is there a faster way to do this or a function that would serve this purpose?
Thanks for any and all help!

You could use stats.rv_discrete:
from scipy import stats
counts = df.category.value_counts()
dist = stats.rv_discrete(values=(counts.index, counts/counts.sum()))
fill_values = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = fill_values
EDIT: For general data(not restricted to integers) you can do:
dist = stats.rv_discrete(values=(np.arange(counts.shape[0]),
counts/counts.sum()))
fill_idxs = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = counts.iloc[fill_idxs].index.values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I replace age binning feature into numerical data? - python

Related

Pandas Group By With Overlapping Bins

Creating a data frame named after values from another data frame

Creation dataframe from several list of lists

In Pandas, how to operate between columns in max perfornace

Fill multiple nulls for categorical data

Categories

Resources