Hot Deck Imputation in Python - python

I have been trying to find Python code that would allow me to replace missing values in a dataframe's column. The focus of my analysis is in biostatistics so I am not comfortable with replacing values using means/medians/modes. I would like to apply the "Hot Deck Imputation" method.
I cannot find any Python functions or packages online that takes the column of a dataframe and fills missing values with the "Hot Deck Imputation" method.
I did, however, see this GitHub project and did not find it useful.
The following is an example of some of my data (assume this is a pandas dataframe):
| age | sex | bmi | anesthesia score | pain level |
|-----|-----|------|------------------|------------|
| 78 | 1 | 40.7 | 3 | 0 |
| 55 | 1 | 25.3 | 3 | 0 |
| 52 | 0 | 25.4 | 3 | 0 |
| 77 | 1 | 44.9 | 3 | 3 |
| 71 | 1 | 26.3 | 3 | 0 |
| 39 | 0 | 28.2 | 2 | 0 |
| 82 | 1 | 27 | 2 | 1 |
| 70 | 1 | 37.9 | 3 | 0 |
| 71 | 1 | NA | 3 | 1 |
| 53 | 0 | 24.5 | 2 | NA |
| 68 | 0 | 34.7 | 3 | 0 |
| 57 | 0 | 30.7 | 2 | 0 |
| 40 | 1 | 22.4 | 2 | 0 |
| 73 | 1 | 34.2 | 2 | 0 |
| 66 | 1 | NA | 3 | 1 |
| 55 | 1 | 42.6 | NA | NA |
| 53 | 0 | 37.5 | 3 | 3 |
| 65 | 0 | 31.6 | 2 | 2 |
| 36 | 0 | 29.6 | 1 | 0 |
| 60 | 0 | 25.7 | 2 | NA |
| 70 | 1 | 30 | NA | NA |
| 66 | 1 | 28.3 | 2 | 0 |
| 63 | 1 | 29.4 | 3 | 2 |
| 70 | 1 | 36 | 3 | 2 |
I would like to apply a Python function that would allow me to input a column as a parameter and return the column with the missing values replaced with imputed values using the "Hot Deck Imputation" method.
I am using this for the purpose of statistical modeling with models such as linear and logistic regression using Statsmodels.api. I am not using this for Machine Learning.
Any help would be much appreciated!

You can use ffill that uses last observation carried forward (LOCF) Hot Code Imputation.
#...
df.fillna(method='ffill', inplace=True)
Scikit-learn impute offers KNN, Mean, Max and other imputing methods. (https://scikit-learn.org/stable/modules/impute.html)
# sklearn '>=0.22.x'
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
DF['imputed_x'] = imputer.fit_transform(DF[['bmi']])
print(DF['imputed_x'])

Related

How can I generate a financial summary using pandas dataframes?

I'd like to create a table from a data frame with subtotals per business, totals per business type, and columns summing multiple value columns. Long term is to create a selection tool based on the ingested Excel sheet for whichever month's summary I bring in to compare month summaries (e.g. did minerals item 26 from BA3 disappear the next month) but I believe that is best saved for another question.
For now, I am having trouble figuring out how to summarize the data.
I have a dataframe in Pandas that contains the following:
Business | Business Type | ID | Value-Q1 | Value-Q2 | Value-Q3 | Value-Q4 | Value-FY |
---------+---------------+----+----------+----------+----------+----------+----------+
BA1 | Widgets | 1 | 7 | 0 | 0 | 8 | 15 |
BA1 | Widgets | 2 | 7 | 0 | 0 | 8 | 15 |
BA1 | Cups | 3 | 9 | 10 | 0 | 0 | 19 |
BA1 | Cups | 4 | 9 | 10 | 0 | 0 | 19 |
BA1 | Cups | 5 | 9 | 10 | 0 | 0 | 19 |
BA1 | Snorkels | 6 | 0 | 0 | 8 | 8 | 16 |
BA1 | Snorkels | 7 | 0 | 0 | 8 | 8 | 16 |
BA1 | Snorkels | 8 | 0 | 0 | 8 | 8 | 16 |
BA2 | Widgets | 9 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 10 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 11 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 12 | 100 | 0 | 7 | 0 | 107 |
BA2 | Bread | 13 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 14 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 15 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 16 | 0 | 0 | 0 | 1 | 1 |
BA2 | Cat Food | 17 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 18 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 19 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 20 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 21 | 504 | 0 | 0 | 500 | 1004 |
BA3 | Gravel | 22 | 7 | 7 | 7 | 7 | 28 |
BA3 | Gravel | 23 | 7 | 7 | 7 | 7 | 28 |
BA3 | Gravel | 24 | 7 | 7 | 7 | 7 | 28 |
BA3 | Rocks | 25 | 3 | 2 | 0 | 0 | 5 |
BA3 | Minerals | 26 | 1 | 1 | 0 | 1 | 3 |
BA3 | Minerals | 27 | 1 | 1 | 0 | 1 | 3 |
BA4 | Widgets | 28 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 29 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 30 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 31 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 32 | 6 | 4 | 0 | 0 | 10 |
BA4 | Something | 33 | 1000 | 0 | 0 | 2 | 1002 |
BA5 | Bonbons | 34 | 60 | 40 | 10 | 0 | 110 |
BA5 | Bonbons | 35 | 60 | 40 | 10 | 0 | 110 |
BA5 | Gummy Bears | 36 | 7 | 0 | 0 | 9 | 16 |
(Imagine each ID has different values as well)
My goal is to slice the data to get the total occurrences of a given business type (e.g. BA 1 has 2 widgets, 3 cups, and 3 snorkels which each have a unique ID) as well as the total values:
Occurrence | Q1 Sum | Q2 Sum | Q3 Sum | Q4 Sum | FY Sum |
BA 1 8 | 41 | 30 | 24 | 40 | 135 |
Widgets 2 | 14 | 0 | 0 | 16 | 30 |
Cups 3 | 27 | 30 | 0 | 0 | 57 |
Snorkels 3 | 0 | 0 | 24 | 24 | 48 |
BA 2 Subtotal of BA2 items below
Widgets Repeat Above
Bread Repeat Above
Cat Food Repeat Above
I have more columns that mirror the Q1-FY columns with other fields (e.g. Value 2 Q1-FY) per line as well that I would like to include on the summary but I imagine I could just repeat whatever process is used to grab the current Value cuts.
I have a list of unique Businesses
businesses = [BA1, BA2, BA3, BA4, BA5]
and a list of unique Business Types
[Widgets, Cups, Snorkels, Bread, Cat Food, Gravel, Rocks, Minerals, Something, Bonbons, Gummy Bears]
and finally a list of the Values
values = [Value-Q1, Value-Q2, Value-Q3, Value-Q4, Value-FY]
and I tried doing a for loop off of the lists
maybe I need to make the dataframe values be on their own individual lines? I tried the following for at least the sum of FY
for b in businesses
for bt in business types
df_sums = df.loc['Business' == b, 'Business Type' == bt, 'Value-FY'].sum()
but it didn't quite give me what I was hoping for
I'm sure there's a better way to at least grab the values (I managed to get FY values per business into a dictionary) for totals but not totals per business per business type (which is also unique per business).
If anyone has any advice or can point me in the right direction I'd really appreciate it!
You should try to use the group_by method for this. Group_by allows for several grouping options. I have attached a link to the documentation on the method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Pandas sum column for each column pair

I have a dataframe as follows. I'm attempting to sum the values in the Total column, for each date, for each unique pair from columns P_buy and P_sell.
+--------+----------+-------+---------+--------+----------+-----------------+
| Index | Date | Type | Quantity| P_buy | P_sell | Total |
+--------+----------+-------+---------+--------+----------+-----------------+
| 0 | 1/1/2020 | 1 | 10 | 1 | 1 | 10 |
| 1 | 1/1/2020 | 1 | 10 | 2 | 1 | 20 |
| 2 | 1/1/2020 | 2 | 20 | 3 | 1 | 25 |
| 3 | 1/1/2020 | 2 | 20 | 4 | 1 | 20 |
| 4 | 2/1/2020 | 3 | 30 | 1 | 1 | 35 |
| 5 | 2/1/2020 | 3 | 30 | 2 | 1 | 30 |
| 6 | 2/1/2020 | 1 | 40 | 3 | 1 | 45 |
| 7 | 2/1/2020 | 1 | 40 | 4 | 1 | 40 |
| 8 | 3/1/2020 | 2 | 50 | 1 | 1 | 55 |
| 9 | 3/1/2020 | 2 | 50 | 2 | 1 | 53 |
+--------+----------+-------+---------+--------+----------+-----------------+
My desired output would be as follows: Where for each combination of unique P_buy/P_sell pairs, I'm receiving a sum of the total at each date.
+--------+----------+-------+---------+
| P_buy | P_sell | Total |
+--------+----------+-------+---------+
| 1 | 1 | 100 |
| 2 | 1 | 103 |
| 3 | 1 | 70 |
+--------+----------+-------+---------+
My attempts have been using the groupby function, but I haven't been able to successfully implement.
# use a groupby on the desired columns and sum the total
df.groupby(['P_buy','P_sell'], as_index=False)['Total'].sum()
P_buy P_sell Total
0 1 1 100
1 2 1 103
2 3 1 70
3 4 1 60

Replace repetitive values in column

I want to add a column which has values not equal to column N after N=31 has reached and then plot it like
plt.plot(X[N==1],FT[N==1]), plt.plot(X[new_col==63],FT[new_col==63])
The data is following
+-------+-----+----+-------+-------+
| X | N | CN | Vdiff | FT |
+-------+-----+----+-------+-------+
| 524 | 2 | 1 | 0.0 | 0.12. |
| 534 | 2 | 1 | 0.0 |0.134. |
| 525 | 2 | 1 | 0.0 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 31 | 14 | 0.0 |3.54. |
| 5913 | 31 | 29 | 0.1 |3.98. |
| 5923 | 0 | 29 | 0.0 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 0.0 |7.36 |
| 33029 | 7 | 36 | 0.0 |8.99 |
| 33023 | 7 | 43 | 0.1 |12.45 |
| 33114 | 0 | 43 | 0.0 |14.33 |
+-------+-----+----+-------+-------+
The solution I want is
+-------+-----+----+-------+------+
| X | N | CN | new_col | FT |
+-------+-----+----+-------+------+
| 524 | 2 | 1 | 2 | 0.12. |
| 534 | 2 | 1 | 2 |0.134. |
| 525 | 2 | 1 | 2 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 31 | 14 | 31 |3.54. |
| 5913 | 31 | 29 | 31 |3.98. |
| 5923 | 0 | 29 | 32 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 45 |7.36 |
| 33029 | 7 | 36 | 45 |8.99 |
| 33023 | 7 | 43 | 45 |12.45 |
| 33114 | 0 | 43 | 46 |14.33 |
+-------+-----+----+-------+-------+
Note that values in new_col should be also repetitive like values in N and should not change in every new row.
Is this the ouput you need? We cannot simply groupby by N as it has repetitive, non-adjacent values, as we need to preserve the order. We count here the condition where N is changed compared to its own previous value.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(
"""X|N|CN|Vdiff|FT
524|2|1|0.0|0.12
534|2|1|0.0|0.134
525|2|1|0.0|0.154
5976|31|14|0.0|3.54
5913|31|29|0.1|3.98
5923|0|29|0.0|3.87
33001|7|36|0.0|7.36
33029|7|36|0.0|8.99
33023|7|43|0.1|12.45
33114|0|43|0.0|14.33"""), sep="|")
# works in pandas 1.2
#>>> df["new_val"] = df.eval("C = N.shift().bfill() != N")["C"].astype(int).cumsum()
# works in older pandas
>>> df["new_val"] = (df.N.shift().bfill() != df.N).astype(int).cumsum()
>>> df
X N CN Vdiff FT new_val
0 524 2 1 0.0 0.120 0
1 534 2 1 0.0 0.134 0
2 525 2 1 0.0 0.154 0
3 5976 31 14 0.0 3.540 1
4 5913 31 29 0.1 3.980 1
5 5923 0 29 0.0 3.870 2
6 33001 7 36 0.0 7.360 3
7 33029 7 36 0.0 8.990 3
8 33023 7 43 0.1 12.450 3
9 33114 0 43 0.0 14.330 4

Pandas merge two dataframe and drop extra rows

How can I merge/join these two dataframes ONLY on "sample_id" and drop the extra rows from the second dataframe when merging/joining?
Using pandas in Python.
First dataframe (fdf)
| sample_id | name |
|-----------|-------|
| 1 | Mark |
| 1 | Dart |
| 2 | Julia |
| 2 | Oolia |
| 2 | Talia |
Second dataframe (sdf)
| sample_id | salary | time |
|-----------|--------|------|
| 1 | 20 | 0 |
| 1 | 30 | 5 |
| 1 | 40 | 10 |
| 1 | 50 | 15 |
| 2 | 33 | 0 |
| 2 | 23 | 5 |
| 2 | 24 | 10 |
| 2 | 28 | 15 |
| 2 | 29 | 20 |
So the resulting df will be like -
| sample_id | name | salary | time |
|-----------|-------|--------|------|
| 1 | Mark | 20 | 0 |
| 1 | Dart | 30 | 5 |
| 2 | Julia | 33 | 0 |
| 2 | Oolia | 23 | 5 |
| 2 | Talia | 24 | 10 |
There are duplicates, so need helper column for correct DataFrame.merge with GroupBy.cumcount for counter:
df = (fdf.assign(g=fdf.groupby('sample_id').cumcount())
.merge(sdf.assign(g=sdf.groupby('sample_id').cumcount()), on=['sample_id', 'g'])
.drop('g', axis=1))
print (df)
sample_id name salary time
0 1 Mark 20 0
1 1 Dart 30 5
2 2 Julia 33 0
3 2 Oolia 23 5
4 2 Talia 24 10
final_res = pd.merge(df,df2,on=['sample_id'],how='left')
final_res.sort_values(['sample_id','name','time'],ascending=[True,True,True],inplace=True)
final_res.drop_duplicates(subset=['sample_id','name'],keep='first',inplace=True)

Pandas dataframe tweak

I have some data as follows:
+-----+-------+-------+--------------------+
| Sys | Event | Code | Duration |
+-----+-------+-------+--------------------+
| | 1 | 65 | 355.52 |
| | 1 | 66 | 18.78 |
| | 1 | 66 | 223.42 |
| | 1 | 66 | 392.17 |
| | 2 | 66 | 449.03 |
| | 2 | 66 | 506.03 |
| | 2 | 66 | 73.93 |
| | 3 | 66 | 123.17 |
| | 3 | 66 | 97.85 |
+-----+-------+-------+--------------------+
Now, for each Code, I want to sum the Durations for all Event = 1 and so on, regardless of Sys. How do I approach this?
As DYZ says:
df.groupby(['Code', 'Event']).Duration.sum()
Output:
Code Event
65 1 355.52
66 1 634.37
2 1028.99
3 221.02
Name: Duration, dtype: float64

Categories

Resources