Replace repetitive values in column

Replace repetitive values in column - python

I want to add a column which has values not equal to column N after N=31 has reached and then plot it like
plt.plot(X[N==1],FT[N==1]), plt.plot(X[new_col==63],FT[new_col==63])
The data is following
+-------+-----+----+-------+-------+
| X | N | CN | Vdiff | FT |
+-------+-----+----+-------+-------+
| 524 | 2 | 1 | 0.0 | 0.12. |
| 534 | 2 | 1 | 0.0 |0.134. |
| 525 | 2 | 1 | 0.0 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 31 | 14 | 0.0 |3.54. |
| 5913 | 31 | 29 | 0.1 |3.98. |
| 5923 | 0 | 29 | 0.0 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 0.0 |7.36 |
| 33029 | 7 | 36 | 0.0 |8.99 |
| 33023 | 7 | 43 | 0.1 |12.45 |
| 33114 | 0 | 43 | 0.0 |14.33 |
+-------+-----+----+-------+-------+
The solution I want is
+-------+-----+----+-------+------+
| X | N | CN | new_col | FT |
+-------+-----+----+-------+------+
| 524 | 2 | 1 | 2 | 0.12. |
| 534 | 2 | 1 | 2 |0.134. |
| 525 | 2 | 1 | 2 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 31 | 14 | 31 |3.54. |
| 5913 | 31 | 29 | 31 |3.98. |
| 5923 | 0 | 29 | 32 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 45 |7.36 |
| 33029 | 7 | 36 | 45 |8.99 |
| 33023 | 7 | 43 | 45 |12.45 |
| 33114 | 0 | 43 | 46 |14.33 |
+-------+-----+----+-------+-------+
Note that values in new_col should be also repetitive like values in N and should not change in every new row.

Is this the ouput you need? We cannot simply groupby by N as it has repetitive, non-adjacent values, as we need to preserve the order. We count here the condition where N is changed compared to its own previous value.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(
"""X|N|CN|Vdiff|FT
524|2|1|0.0|0.12
534|2|1|0.0|0.134
525|2|1|0.0|0.154
5976|31|14|0.0|3.54
5913|31|29|0.1|3.98
5923|0|29|0.0|3.87
33001|7|36|0.0|7.36
33029|7|36|0.0|8.99
33023|7|43|0.1|12.45
33114|0|43|0.0|14.33"""), sep="|")
# works in pandas 1.2
#>>> df["new_val"] = df.eval("C = N.shift().bfill() != N")["C"].astype(int).cumsum()
# works in older pandas
>>> df["new_val"] = (df.N.shift().bfill() != df.N).astype(int).cumsum()
>>> df
X N CN Vdiff FT new_val
0 524 2 1 0.0 0.120 0
1 534 2 1 0.0 0.134 0
2 525 2 1 0.0 0.154 0
3 5976 31 14 0.0 3.540 1
4 5913 31 29 0.1 3.980 1
5 5923 0 29 0.0 3.870 2
6 33001 7 36 0.0 7.360 3
7 33029 7 36 0.0 8.990 3
8 33023 7 43 0.1 12.450 3
9 33114 0 43 0.0 14.330 4

Related

Efficient way to apply a pandas conditional to a large list of values [duplicate]

I have dataframe df_groups that contain sample number, group number and accuracy.
Tabel 1: Samples with their groups
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 0 | 0 | 6 | 91.6 |
| 1 | 1 | 4 | 92.9333 |
| 2 | 2 | 2 | 91 |
| 3 | 3 | 2 | 90.0667 |
| 4 | 4 | 4 | 91.8 |
| 5 | 5 | 5 | 92.5667 |
| 6 | 6 | 6 | 91.1 |
| 7 | 7 | 5 | 92.3333 |
| 8 | 8 | 2 | 92.7667 |
| 9 | 9 | 0 | 91.1333 |
| 10 | 10 | 4 | 92.5 |
| 11 | 11 | 5 | 92.4 |
| 12 | 12 | 7 | 93.1333 |
| 13 | 13 | 7 | 93.5333 |
| 14 | 14 | 2 | 92.1 |
| 15 | 15 | 6 | 93.2 |
| 16 | 16 | 8 | 92.7333 |
| 17 | 17 | 8 | 90.8 |
| 18 | 18 | 3 | 91.9 |
| 19 | 19 | 3 | 93.3 |
| 20 | 20 | 5 | 90.6333 |
| 21 | 21 | 9 | 92.9333 |
| 22 | 22 | 4 | 93.3333 |
| 23 | 23 | 9 | 91.5333 |
| 24 | 24 | 9 | 92.9333 |
| 25 | 25 | 1 | 92.3 |
| 26 | 26 | 9 | 92.2333 |
| 27 | 27 | 6 | 91.9333 |
| 28 | 28 | 5 | 92.1 |
| 29 | 29 | 8 | 84.8 |
+----+----------+------------+------------+
I want to return a dataframe with any accuracy above (e.g. 92).
so the results will be like this
Tabel 1: Samples with their groups when accuracy above 92.
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 1 | 1 | 4 | 92.9333 |
| 2 | 5 | 5 | 92.5667 |
| 3 | 7 | 5 | 92.3333 |
| 4 | 8 | 2 | 92.7667 |
| 5 | 10 | 4 | 92.5 |
| 6 | 11 | 5 | 92.4 |
| 7 | 12 | 7 | 93.1333 |
| 8 | 13 | 7 | 93.5333 |
| 9 | 14 | 2 | 92.1 |
| 10 | 15 | 6 | 93.2 |
| 11 | 16 | 8 | 92.7333 |
| 12 | 19 | 3 | 93.3 |
| 13 | 21 | 9 | 92.9333 |
| 14 | 22 | 4 | 93.3333 |
| 15 | 24 | 9 | 92.9333 |
| 16 | 25 | 1 | 92.3 |
| 17 | 26 | 9 | 92.2333 |
| 18 | 28 | 5 | 92.1 |
+----+----------+------------+------------+
so, the result will return based on the condition that is greater than or equal to the predefined accuracy (e.g. 92, 90 or 85, ect).

You can use df.loc[df['Accuracy'] >= predefined_accuracy] .

Pandas sum column for each column pair

I have a dataframe as follows. I'm attempting to sum the values in the Total column, for each date, for each unique pair from columns P_buy and P_sell.
+--------+----------+-------+---------+--------+----------+-----------------+
| Index | Date | Type | Quantity| P_buy | P_sell | Total |
+--------+----------+-------+---------+--------+----------+-----------------+
| 0 | 1/1/2020 | 1 | 10 | 1 | 1 | 10 |
| 1 | 1/1/2020 | 1 | 10 | 2 | 1 | 20 |
| 2 | 1/1/2020 | 2 | 20 | 3 | 1 | 25 |
| 3 | 1/1/2020 | 2 | 20 | 4 | 1 | 20 |
| 4 | 2/1/2020 | 3 | 30 | 1 | 1 | 35 |
| 5 | 2/1/2020 | 3 | 30 | 2 | 1 | 30 |
| 6 | 2/1/2020 | 1 | 40 | 3 | 1 | 45 |
| 7 | 2/1/2020 | 1 | 40 | 4 | 1 | 40 |
| 8 | 3/1/2020 | 2 | 50 | 1 | 1 | 55 |
| 9 | 3/1/2020 | 2 | 50 | 2 | 1 | 53 |
+--------+----------+-------+---------+--------+----------+-----------------+
My desired output would be as follows: Where for each combination of unique P_buy/P_sell pairs, I'm receiving a sum of the total at each date.
+--------+----------+-------+---------+
| P_buy | P_sell | Total |
+--------+----------+-------+---------+
| 1 | 1 | 100 |
| 2 | 1 | 103 |
| 3 | 1 | 70 |
+--------+----------+-------+---------+
My attempts have been using the groupby function, but I haven't been able to successfully implement.

# use a groupby on the desired columns and sum the total
df.groupby(['P_buy','P_sell'], as_index=False)['Total'].sum()
P_buy P_sell Total
0 1 1 100
1 2 1 103
2 3 1 70
3 4 1 60

shift below cells to count for R

I am using the code below to produce following result in Python and I want equivalent for this code on R.
here N is the column of dataframe data . CN column is calculated from values of column N with a specific pattern and it gives me following result in python.
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+
a short overview of my code is
data = pd.read_table(filename,skiprows=15,decimal=',', sep='\t',header=None,names=["Date ","Heure ","temps (s) ","X","Z"," LVDT V(mm) " ,"Force normale (N) ","FT","FN(N) ","TS"," NS(kPa) ","V (mm/min)","Vitesse normale (mm/min)","e (kPa)","k (kPa/mm) " ,"N " ,"Nb cycles normal" ,"Cycles " ,"Etat normal" ,"k imposÈ (kPa/mm)"])
data.columns = [col.strip() for col in data.columns.tolist()]
N = data[data.keys()[15]]
N = np.array(N)
data["CN"] = (data.N.shift().bfill() != data.N).astype(int).cumsum()
an example of data.head() is here
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| Index | Date | Heure | temps (s) | X | Z(mm) | LVDT V(mm) | Force normale (N) | FT | FN(N) | FT (kPa) | NS(kPa) | V (mm/min) | Vitesse normale (mm/min) | e (kPa) | k (kPa/mm) | N | Nb cycles normal | Cycles | Etat normal | k imposÈ (kPa/mm) | CN |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| 184 | 01/02/2022 | 12:36:52 | 402.163 | 6.910243 | 1.204797 | 0.001101 | 299.783665 | 31.494351 | 1428.988908 | 11.188704 | 505.825016 | 0.1 | 2.0 | 512.438828 | 50.918786 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 185 | 01/02/2022 | 12:36:54 | 404.288 | 6.907822 | 1.205647 | 4.9e-05 | 296.072718 | 31.162313 | 1404.195316 | 11.028167 | 494.97955 | 0.1 | -2.0 | 500.084986 | 49.685639 | 0.0 | 0.0 | Sort | Descend | 0.0 | 0 |
| 186 | 01/02/2022 | 12:36:56 | 406.536 | 6.907906 | 1.204194 | -0.000214 | 300.231424 | 31.586401 | 1429.123486 | 11.21895 | 505.750815 | 0.1 | 2.0 | 512.370164 | 50.914002 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 187 | 01/02/2022 | 12:36:58 | 408.627 | 6.910751 | 1.204293 | -0.000608 | 300.188686 | 31.754064 | 1428.979519 | 11.244542 | 505.624564 | 0.1 | 2.0 | 512.309254 | 50.906544 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 188 | 01/02/2022 | 12:37:00 | 410.679 | 6.907805 | 1.205854 | -0.000181 | 296.358074 | 31.563389 | 1415.224427 | 11.129375 | 502.464948 | 0.1 | 2.0 | 510.702313 | 50.742104 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+

A one line cumsum trick solves it.
cumsum(c(0L, diff(df1$N) != 0))
#> [1] 0 1 1 2 2 3 3 4 4 4 5 5 6 7 8 9 10
all.equal(
cumsum(c(0L, diff(df1$N) != 0)),
df1$CN
)
#> [1] TRUE
Created on 2022-02-14 by the reprex package (v2.0.1)
Data
x <- "
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+"
df1 <- read.table(textConnection(x), header = TRUE, sep = "|", comment.char = "+")[2:3]
Created on 2022-02-14 by the reprex package (v2.0.1)

Hot Deck Imputation in Python

I have been trying to find Python code that would allow me to replace missing values in a dataframe's column. The focus of my analysis is in biostatistics so I am not comfortable with replacing values using means/medians/modes. I would like to apply the "Hot Deck Imputation" method.
I cannot find any Python functions or packages online that takes the column of a dataframe and fills missing values with the "Hot Deck Imputation" method.
I did, however, see this GitHub project and did not find it useful.
The following is an example of some of my data (assume this is a pandas dataframe):
| age | sex | bmi | anesthesia score | pain level |
|-----|-----|------|------------------|------------|
| 78 | 1 | 40.7 | 3 | 0 |
| 55 | 1 | 25.3 | 3 | 0 |
| 52 | 0 | 25.4 | 3 | 0 |
| 77 | 1 | 44.9 | 3 | 3 |
| 71 | 1 | 26.3 | 3 | 0 |
| 39 | 0 | 28.2 | 2 | 0 |
| 82 | 1 | 27 | 2 | 1 |
| 70 | 1 | 37.9 | 3 | 0 |
| 71 | 1 | NA | 3 | 1 |
| 53 | 0 | 24.5 | 2 | NA |
| 68 | 0 | 34.7 | 3 | 0 |
| 57 | 0 | 30.7 | 2 | 0 |
| 40 | 1 | 22.4 | 2 | 0 |
| 73 | 1 | 34.2 | 2 | 0 |
| 66 | 1 | NA | 3 | 1 |
| 55 | 1 | 42.6 | NA | NA |
| 53 | 0 | 37.5 | 3 | 3 |
| 65 | 0 | 31.6 | 2 | 2 |
| 36 | 0 | 29.6 | 1 | 0 |
| 60 | 0 | 25.7 | 2 | NA |
| 70 | 1 | 30 | NA | NA |
| 66 | 1 | 28.3 | 2 | 0 |
| 63 | 1 | 29.4 | 3 | 2 |
| 70 | 1 | 36 | 3 | 2 |
I would like to apply a Python function that would allow me to input a column as a parameter and return the column with the missing values replaced with imputed values using the "Hot Deck Imputation" method.
I am using this for the purpose of statistical modeling with models such as linear and logistic regression using Statsmodels.api. I am not using this for Machine Learning.
Any help would be much appreciated!

You can use ffill that uses last observation carried forward (LOCF) Hot Code Imputation.
#...
df.fillna(method='ffill', inplace=True)
Scikit-learn impute offers KNN, Mean, Max and other imputing methods. (https://scikit-learn.org/stable/modules/impute.html)
# sklearn '>=0.22.x'
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
DF['imputed_x'] = imputer.fit_transform(DF[['bmi']])
print(DF['imputed_x'])

Reattach ID column to data after passing through sklearn model

I am working on trying to create a predictive regression model that forecasts the completion date of a number of orders.
My dataset looks like:
| ORDER_NUMBER | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | Feature6 | TOTAL_DAYS_TO_COMPLETE | Feature8 | Feature9 | Feature10 | Feature11 | Feature12 | Feature13 | Feature14 | Feature15 | Feature16 | Feature17 | Feature18 | Feature19 | Feature20 | Feature21 | Feature22 | Feature23 | Feature24 | Feature25 | Feature26 | Feature27 | Feature28 | Feature29 | Feature30 | Feature31 |
|:------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------------------:|:--------:|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|
| 102203591 | 12 | 2014 | 10 | 2014 | 1 | 2015 | 760 | 50 | 83 | 5 | 6 | 12 | 18 | 31 | 8 | 0 | 1 | 0 | 1 | 16 | 131.29 | 24.3768 | 158.82 | 1.13 | 6.52 | 10 | 51 | 39 | 27 | 88 | 1084938 |
| 102231010 | 2 | 2015 | 1 | 2015 | 2 | 2015 | 706 | 35 | 34 | 2 | 1 | 4 | 3 | 3 | 3 | 0 | 0 | 0 | 1 | 2 | 11.95 | 5.162 | 17.83 | 1.14 | 3.45 | 1 | 4 | 20 | 16 | 25 | 367140 |
| 102251893 | 6 | 2015 | 4 | 2015 | 3 | 2015 | 1143 | 36 | 43 | 1 | 2 | 4 | 5 | 6 | 3 | 1 | 0 | 0 | 1 | 5 | 8.55 | 5.653 | 34.51 | 4.59 | 6.1 | 0 | 1 | 17 | 30 | 12 | 103906 |
| 102287793 | 4 | 2015 | 2 | 2015 | 4 | 2015 | 733 | 45 | 71 | 4 | 1 | 6 | 35 | 727 | 6 | 0 | 3 | 15 | 0 | 19 | 174.69 | 97.448 | 319.98 | 1.49 | 3.28 | 20 | 113 | 71 | 59 | 71 | 1005041 |
| 102288060 | 6 | 2015 | 5 | 2015 | 4 | 2015 | 1092 | 26 | 21 | 1 | 1 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 4.73 | 4.5363 | 18.85 | 3.11 | 4.16 | 0 | 1 | 16 | 8 | 16 | 69062 |
| 102308069 | 8 | 2015 | 6 | 2015 | 5 | 2015 | 676 | 41 | 34 | 2 | 0 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 2.98 | 6.1173 | 11.3 | 1.36 | 1.85 | 0 | 1 | 17 | 12 | 3 | 145887 |
| 102319918 | 8 | 2015 | 7 | 2015 | 6 | 2015 | 884 | 25 | 37 | 1 | 1 | 3 | 2 | 3 | 2 | 0 | 0 | 1 | 0 | 2 | 5.57 | 3.7083 | 9.18 | 0.97 | 2.48 | 0 | 1 | 14 | 5 | 7 | 45243 |
| 102327578 | 6 | 2015 | 4 | 2015 | 6 | 2015 | 595 | 49 | 68 | 3 | 5 | 9 | 11 | 13 | 5 | 4 | 2 | 0 | 1 | 10 | 55.41 | 24.3768 | 104.98 | 2.03 | 4.31 | 10 | 51 | 39 | 26 | 40 | 418266 |
| 102337989 | 7 | 2015 | 5 | 2015 | 7 | 2015 | 799 | 50 | 66 | 5 | 6 | 12 | 21 | 29 | 12 | 0 | 0 | 0 | 1 | 20 | 138.79 | 24.3768 | 172.56 | 1.39 | 7.08 | 10 | 51 | 39 | 34 | 101 | 1229299 |
| 102450069 | 8 | 2015 | 7 | 2015 | 11 | 2015 | 456 | 20 | 120 | 2 | 1 | 3 | 12 | 14 | 8 | 0 | 0 | 0 | 0 | 7 | 2.92 | 6.561 | 12.3 | 1.43 | 1.87 | 2 | 1 | 15 | 6 | 6 | 142805 |
| 102514564 | 5 | 2016 | 3 | 2016 | 2 | 2016 | 639 | 25 | 35 | 1 | 2 | 4 | 3 | 6 | 3 | 0 | 0 | 0 | 0 | 3 | 4.83 | 4.648 | 14.22 | 2.02 | 3.06 | 0 | 1 | 15 | 5 | 13 | 62941 |
| 102528121 | 10 | 2015 | 9 | 2015 | 3 | 2016 | 413 | 15 | 166 | 1 | 1 | 3 | 2 | 3 | 2 | 0 | 0 | 0 | 0 | 2 | 4.23 | 1.333 | 15.78 | 8.66 | 11.84 | 1 | 4 | 8 | 6 | 3 | 111752 |
| 102564376 | 1 | 2016 | 12 | 2015 | 4 | 2016 | 802 | 27 | 123 | 2 | 1 | 4 | 3 | 3 | 3 | 0 | 1 | 0 | 0 | 3 | 1.27 | 2.063 | 6.9 | 2.73 | 3.34 | 1 | 4 | 14 | 20 | 6 | 132403 |
| 102564472 | 1 | 2016 | 12 | 2015 | 4 | 2016 | 817 | 27 | 123 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1.03 | 2.063 | 9.86 | 4.28 | 4.78 | 1 | 4 | 14 | 22 | 4 | 116907 |
| 102599569 | 2 | 2016 | 12 | 2015 | 5 | 2016 | 425 | 47 | 151 | 1 | 2 | 4 | 3 | 4 | 3 | 0 | 0 | 0 | 0 | 2 | 27.73 | 15.8993 | 60.5 | 2.06 | 3.81 | 12 | 108 | 34 | 24 | 20 | 119743 |
| 102599628 | 2 | 2016 | 12 | 2015 | 5 | 2016 | 425 | 47 | 151 | 3 | 4 | 8 | 8 | 9 | 7 | 0 | 0 | 0 | 2 | 8 | 39.28 | 14.8593 | 91.26 | 3.5 | 6.14 | 12 | 108 | 34 | 38 | 15 | 173001 |
| 102606421 | 3 | 2016 | 12 | 2015 | 5 | 2016 | 965 | 55 | 161 | 5 | 11 | 17 | 29 | 44 | 11 | 1 | 1 | 0 | 1 | 22 | 148.06 | 23.7983 | 195.69 | 2 | 8.22 | 10 | 51 | 39 | 47 | 112 | 1196097 |
| 102621293 | 7 | 2016 | 5 | 2016 | 6 | 2016 | 701 | 42 | 27 | 2 | 1 | 4 | 3 | 3 | 1 | 0 | 0 | 0 | 1 | 2 | 8.39 | 3.7455 | 13.93 | 1.48 | 3.72 | 1 | 5 | 14 | 14 | 20 | 258629 |
| 102632364 | 7 | 2016 | 6 | 2016 | 6 | 2016 | 982 | 41 | 26 | 4 | 2 | 7 | 6 | 6 | 2 | 0 | 0 | 0 | 1 | 4 | 26.07 | 2.818 | 37.12 | 3.92 | 13.17 | 1 | 5 | 14 | 22 | 10 | 167768 |
| 102643207 | 9 | 2016 | 9 | 2016 | 7 | 2016 | 255 | 9 | 73 | 3 | 1 | 5 | 4 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 2.17 | 0.188 | 4.98 | 14.95 | 26.49 | 1 | 4 | 2 | 11 | 1 | 49070 |
| 102656091 | 9 | 2016 | 8 | 2016 | 7 | 2016 | 356 | 21 | 35 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1.45 | 2.0398 | 5.54 | 2.01 | 2.72 | 1 | 4 | 14 | 15 | 3 | 117107 |
| 102660407 | 9 | 2016 | 8 | 2016 | 7 | 2016 | 462 | 21 | 31 | 2 | 0 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 3.18 | 2.063 | 8.76 | 2.7 | 4.25 | 1 | 4 | 14 | 14 | 10 | 151272 |
| 102665666 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0.188 | 2.95 | 10.37 | 15.69 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102665667 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.72 | 0.188 | 2.22 | 7.98 | 11.81 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102665668 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.9 | 0.188 | 2.24 | 7.13 | 11.91 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102666306 | 7 | 2016 | 6 | 2016 | 7 | 2016 | 235 | 16 | 34 | 3 | 1 | 5 | 5 | 6 | 4 | 0 | 0 | 0 | 0 | 3 | 14.06 | 3.3235 | 31.27 | 5.18 | 9.41 | 1 | 1 | 16 | 5 | 18 | 246030 |
| 102668177 | 8 | 2016 | 6 | 2016 | 8 | 2016 | 233 | 36 | 32 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 2.5 | 5.2043 | 8.46 | 1.15 | 1.63 | 0 | 1 | 14 | 2 | 4 | 89059 |
| 102669909 | 6 | 2016 | 4 | 2016 | 8 | 2016 | 244 | 46 | 105 | 4 | 11 | 16 | 28 | 30 | 15 | 1 | 2 | 1 | 1 | 25 | 95.49 | 26.541 | 146.89 | 1.94 | 5.53 | 1 | 51 | 33 | 9 | 48 | 78488 |
| 102670188 | 5 | 2016 | 4 | 2016 | 8 | 2016 | 413 | 20 | 109 | 1 | 1 | 2 | 2 | 3 | 2 | 0 | 0 | 0 | 0 | 1 | 2.36 | 6.338 | 8.25 | 0.93 | 1.3 | 2 | 1 | 14 | 5 | 3 | 117137 |
| 102671063 | 8 | 2016 | 6 | 2016 | 8 | 2016 | 296 | 46 | 44 | 2 | 4 | 7 | 7 | 111 | 3 | 1 | 0 | 1 | 0 | 7 | 12.96 | 98.748 | 146.24 | 1.35 | 1.48 | 20 | 113 | 70 | 26 | 9 | 430192 |
| 102672475 | 8 | 2016 | 7 | 2016 | 8 | 2016 | 217 | 20 | 23 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0.5 | 4.9093 | 5.37 | 0.99 | 1.09 | 0 | 1 | 16 | 0 | 1 | 116673 |
| 102672477 | 10 | 2016 | 9 | 2016 | 8 | 2016 | 194 | 20 | 36 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.61 | 5.1425 | 3.65 | 0.59 | 0.71 | 0 | 1 | 16 | 0 | 2 | 98750 |
| 102672513 | 10 | 2016 | 9 | 2016 | 8 | 2016 | 228 | 20 | 36 | 1 | 1 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0.25 | 5.1425 | 6.48 | 1.21 | 1.26 | 0 | 1 | 16 | 0 | 2 | 116780 |
| 102682943 | 5 | 2016 | 4 | 2016 | 8 | 2016 | 417 | 20 | 113 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.64 | 6.338 | 5.53 | 0.77 | 0.87 | 2 | 1 | 14 | 5 | 2 | 100307 |
The ORDER_NUMBER should not be a feature in the model -- it is a unique identifier that I would like to essentially not count in the model since it is just a random ID, but include in the final dataset, so I can tie back predictions and actual values to the order.
Currently, my code looks like this:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances.sort_values(by='Gini-importance', ascending = False)
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def print_to_file(filepath, arr):
with open(filepath, 'w') as f:
for item in arr:
f.write("%s\n" % item)
# READ IN THE DATA TABLE ABOVE
data = pd.read_csv('test.csv')
# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# Remove the order number since we don't need it
data = data.drop('ORDER_NUMBER', axis=1)
# remove the header
data = data[1:]
# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)
rf = RandomForestRegressor(
bootstrap = True,
max_depth = None,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
If I print(y_test) and print(rf_predictions) I get something like:
**print(y_test)**
7
155
84
64
49
41
200
168
43
111
64
46
96
47
50
27
216
..
**print(rf_predictions)**
34.496
77.366
69.6105
61.6825
80.8495
79.8785
177.5465
129.014
70.0405
97.3975
82.4435
57.9575
108.018
57.5515
..
And it works. If I print out y_test and rf_predictions, I get the labels for the test data and the predicted label values.
However, I would like to see what orders are associated both the y_test values and the rf_predictions values. How can I keep that dataset and create a dataframe (like below):
| Order Number | Predicted Value | Actual Value |
|--------------|-------------------|--------------|
| Foo0 | 34.496 | 7 |
| Foo1 | 77.366 | 155 |
| Foo2 | 69.6105 | 84 |
| Foo3 | 61.6825 | 64 |
I have tried looking at this post but I could not get a solution. I did try print(y_test, rf_predictions) and that did not do any good since I have .drop() the ORDER_NUMBER field.

As you're using pandas dataframes the index is retained in all your x/y train/test datasets, so you can re-assemble it after you applied the model. We just need to save the order numbers before dropping that column: order_numbers = data['ORDER_NUMBER']. The predictions rf_predictions are returned in the same order as the input data to rf.predict(X_test), i.e. rf_predictions[i] belongs to X_test.iloc[i].
This creates your required result dataset:
res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
Btw, data = data[1:] doesn't remove the header, it removes the first row, so there's no need to remove anything when you work with pandas dataframes.
So the final program will be:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances.sort_values(by='Gini-importance', ascending = False)
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def print_to_file(filepath, arr):
with open(filepath, 'w') as f:
for item in arr:
f.write("%s\n" % item)
# READ IN THE DATA TABLE ABOVE
data = pd.read_csv('test.csv')
# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# Remove the order number since we don't need it
order_numbers = data['ORDER_NUMBER']
data = data.drop('ORDER_NUMBER', axis=1)
# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)
rf = RandomForestRegressor(
bootstrap = True,
max_depth = None,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
print(res)
With your example data from above we get (for train_test_split with random_state=1):
ORDER_NUMBER Predicted Value Actual Value
3 102287793 652.0746 733
14 102599569 650.3984 425
19 102643207 319.4964 255
20 102656091 388.6004 356
26 102668177 475.1724 233
27 102669909 671.9158 244
32 102672513 319.1550 228

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace repetitive values in column - python

Related

Efficient way to apply a pandas conditional to a large list of values [duplicate]

Pandas sum column for each column pair

shift below cells to count for R

Hot Deck Imputation in Python

Reattach ID column to data after passing through sklearn model

Categories

Resources