Replace repetitive values in column - python

I want to add a column which has values not equal to column N after N=31 has reached and then plot it like
plt.plot(X[N==1],FT[N==1]), plt.plot(X[new_col==63],FT[new_col==63])
The data is following
+-------+-----+----+-------+-------+
| X | N | CN | Vdiff | FT |
+-------+-----+----+-------+-------+
| 524 | 2 | 1 | 0.0 | 0.12. |
| 534 | 2 | 1 | 0.0 |0.134. |
| 525 | 2 | 1 | 0.0 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 31 | 14 | 0.0 |3.54. |
| 5913 | 31 | 29 | 0.1 |3.98. |
| 5923 | 0 | 29 | 0.0 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 0.0 |7.36 |
| 33029 | 7 | 36 | 0.0 |8.99 |
| 33023 | 7 | 43 | 0.1 |12.45 |
| 33114 | 0 | 43 | 0.0 |14.33 |
+-------+-----+----+-------+-------+
The solution I want is
+-------+-----+----+-------+------+
| X | N | CN | new_col | FT |
+-------+-----+----+-------+------+
| 524 | 2 | 1 | 2 | 0.12. |
| 534 | 2 | 1 | 2 |0.134. |
| 525 | 2 | 1 | 2 |0.154. |
| . | | | |. |
| . | | | |. |
| 5976 | 31 | 14 | 31 |3.54. |
| 5913 | 31 | 29 | 31 |3.98. |
| 5923 | 0 | 29 | 32 |3.87. |
| . | | | |. |
| . | | | |. |
| 33001 | 7 | 36 | 45 |7.36 |
| 33029 | 7 | 36 | 45 |8.99 |
| 33023 | 7 | 43 | 45 |12.45 |
| 33114 | 0 | 43 | 46 |14.33 |
+-------+-----+----+-------+-------+
Note that values in new_col should be also repetitive like values in N and should not change in every new row.

Is this the ouput you need? We cannot simply groupby by N as it has repetitive, non-adjacent values, as we need to preserve the order. We count here the condition where N is changed compared to its own previous value.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(
"""X|N|CN|Vdiff|FT
524|2|1|0.0|0.12
534|2|1|0.0|0.134
525|2|1|0.0|0.154
5976|31|14|0.0|3.54
5913|31|29|0.1|3.98
5923|0|29|0.0|3.87
33001|7|36|0.0|7.36
33029|7|36|0.0|8.99
33023|7|43|0.1|12.45
33114|0|43|0.0|14.33"""), sep="|")
# works in pandas 1.2
#>>> df["new_val"] = df.eval("C = N.shift().bfill() != N")["C"].astype(int).cumsum()
# works in older pandas
>>> df["new_val"] = (df.N.shift().bfill() != df.N).astype(int).cumsum()
>>> df
X N CN Vdiff FT new_val
0 524 2 1 0.0 0.120 0
1 534 2 1 0.0 0.134 0
2 525 2 1 0.0 0.154 0
3 5976 31 14 0.0 3.540 1
4 5913 31 29 0.1 3.980 1
5 5923 0 29 0.0 3.870 2
6 33001 7 36 0.0 7.360 3
7 33029 7 36 0.0 8.990 3
8 33023 7 43 0.1 12.450 3
9 33114 0 43 0.0 14.330 4

Related

Efficient way to apply a pandas conditional to a large list of values [duplicate]

I have dataframe df_groups that contain sample number, group number and accuracy.
Tabel 1: Samples with their groups
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 0 | 0 | 6 | 91.6 |
| 1 | 1 | 4 | 92.9333 |
| 2 | 2 | 2 | 91 |
| 3 | 3 | 2 | 90.0667 |
| 4 | 4 | 4 | 91.8 |
| 5 | 5 | 5 | 92.5667 |
| 6 | 6 | 6 | 91.1 |
| 7 | 7 | 5 | 92.3333 |
| 8 | 8 | 2 | 92.7667 |
| 9 | 9 | 0 | 91.1333 |
| 10 | 10 | 4 | 92.5 |
| 11 | 11 | 5 | 92.4 |
| 12 | 12 | 7 | 93.1333 |
| 13 | 13 | 7 | 93.5333 |
| 14 | 14 | 2 | 92.1 |
| 15 | 15 | 6 | 93.2 |
| 16 | 16 | 8 | 92.7333 |
| 17 | 17 | 8 | 90.8 |
| 18 | 18 | 3 | 91.9 |
| 19 | 19 | 3 | 93.3 |
| 20 | 20 | 5 | 90.6333 |
| 21 | 21 | 9 | 92.9333 |
| 22 | 22 | 4 | 93.3333 |
| 23 | 23 | 9 | 91.5333 |
| 24 | 24 | 9 | 92.9333 |
| 25 | 25 | 1 | 92.3 |
| 26 | 26 | 9 | 92.2333 |
| 27 | 27 | 6 | 91.9333 |
| 28 | 28 | 5 | 92.1 |
| 29 | 29 | 8 | 84.8 |
+----+----------+------------+------------+
I want to return a dataframe with any accuracy above (e.g. 92).
so the results will be like this
Tabel 1: Samples with their groups when accuracy above 92.
+----+----------+------------+------------+
| | sample | group | Accuracy |
|----+----------+------------+------------|
| 1 | 1 | 4 | 92.9333 |
| 2 | 5 | 5 | 92.5667 |
| 3 | 7 | 5 | 92.3333 |
| 4 | 8 | 2 | 92.7667 |
| 5 | 10 | 4 | 92.5 |
| 6 | 11 | 5 | 92.4 |
| 7 | 12 | 7 | 93.1333 |
| 8 | 13 | 7 | 93.5333 |
| 9 | 14 | 2 | 92.1 |
| 10 | 15 | 6 | 93.2 |
| 11 | 16 | 8 | 92.7333 |
| 12 | 19 | 3 | 93.3 |
| 13 | 21 | 9 | 92.9333 |
| 14 | 22 | 4 | 93.3333 |
| 15 | 24 | 9 | 92.9333 |
| 16 | 25 | 1 | 92.3 |
| 17 | 26 | 9 | 92.2333 |
| 18 | 28 | 5 | 92.1 |
+----+----------+------------+------------+
so, the result will return based on the condition that is greater than or equal to the predefined accuracy (e.g. 92, 90 or 85, ect).
You can use df.loc[df['Accuracy'] >= predefined_accuracy] .

Pandas sum column for each column pair

I have a dataframe as follows. I'm attempting to sum the values in the Total column, for each date, for each unique pair from columns P_buy and P_sell.
+--------+----------+-------+---------+--------+----------+-----------------+
| Index | Date | Type | Quantity| P_buy | P_sell | Total |
+--------+----------+-------+---------+--------+----------+-----------------+
| 0 | 1/1/2020 | 1 | 10 | 1 | 1 | 10 |
| 1 | 1/1/2020 | 1 | 10 | 2 | 1 | 20 |
| 2 | 1/1/2020 | 2 | 20 | 3 | 1 | 25 |
| 3 | 1/1/2020 | 2 | 20 | 4 | 1 | 20 |
| 4 | 2/1/2020 | 3 | 30 | 1 | 1 | 35 |
| 5 | 2/1/2020 | 3 | 30 | 2 | 1 | 30 |
| 6 | 2/1/2020 | 1 | 40 | 3 | 1 | 45 |
| 7 | 2/1/2020 | 1 | 40 | 4 | 1 | 40 |
| 8 | 3/1/2020 | 2 | 50 | 1 | 1 | 55 |
| 9 | 3/1/2020 | 2 | 50 | 2 | 1 | 53 |
+--------+----------+-------+---------+--------+----------+-----------------+
My desired output would be as follows: Where for each combination of unique P_buy/P_sell pairs, I'm receiving a sum of the total at each date.
+--------+----------+-------+---------+
| P_buy | P_sell | Total |
+--------+----------+-------+---------+
| 1 | 1 | 100 |
| 2 | 1 | 103 |
| 3 | 1 | 70 |
+--------+----------+-------+---------+
My attempts have been using the groupby function, but I haven't been able to successfully implement.
# use a groupby on the desired columns and sum the total
df.groupby(['P_buy','P_sell'], as_index=False)['Total'].sum()
P_buy P_sell Total
0 1 1 100
1 2 1 103
2 3 1 70
3 4 1 60

shift below cells to count for R

I am using the code below to produce following result in Python and I want equivalent for this code on R.
here N is the column of dataframe data . CN column is calculated from values of column N with a specific pattern and it gives me following result in python.
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+
a short overview of my code is
data = pd.read_table(filename,skiprows=15,decimal=',', sep='\t',header=None,names=["Date ","Heure ","temps (s) ","X","Z"," LVDT V(mm) " ,"Force normale (N) ","FT","FN(N) ","TS"," NS(kPa) ","V (mm/min)","Vitesse normale (mm/min)","e (kPa)","k (kPa/mm) " ,"N " ,"Nb cycles normal" ,"Cycles " ,"Etat normal" ,"k imposÈ (kPa/mm)"])
data.columns = [col.strip() for col in data.columns.tolist()]
N = data[data.keys()[15]]
N = np.array(N)
data["CN"] = (data.N.shift().bfill() != data.N).astype(int).cumsum()
an example of data.head() is here
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| Index | Date | Heure | temps (s) | X | Z(mm) | LVDT V(mm) | Force normale (N) | FT | FN(N) | FT (kPa) | NS(kPa) | V (mm/min) | Vitesse normale (mm/min) | e (kPa) | k (kPa/mm) | N | Nb cycles normal | Cycles | Etat normal | k imposÈ (kPa/mm) | CN |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| 184 | 01/02/2022 | 12:36:52 | 402.163 | 6.910243 | 1.204797 | 0.001101 | 299.783665 | 31.494351 | 1428.988908 | 11.188704 | 505.825016 | 0.1 | 2.0 | 512.438828 | 50.918786 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 185 | 01/02/2022 | 12:36:54 | 404.288 | 6.907822 | 1.205647 | 4.9e-05 | 296.072718 | 31.162313 | 1404.195316 | 11.028167 | 494.97955 | 0.1 | -2.0 | 500.084986 | 49.685639 | 0.0 | 0.0 | Sort | Descend | 0.0 | 0 |
| 186 | 01/02/2022 | 12:36:56 | 406.536 | 6.907906 | 1.204194 | -0.000214 | 300.231424 | 31.586401 | 1429.123486 | 11.21895 | 505.750815 | 0.1 | 2.0 | 512.370164 | 50.914002 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 187 | 01/02/2022 | 12:36:58 | 408.627 | 6.910751 | 1.204293 | -0.000608 | 300.188686 | 31.754064 | 1428.979519 | 11.244542 | 505.624564 | 0.1 | 2.0 | 512.309254 | 50.906544 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 188 | 01/02/2022 | 12:37:00 | 410.679 | 6.907805 | 1.205854 | -0.000181 | 296.358074 | 31.563389 | 1415.224427 | 11.129375 | 502.464948 | 0.1 | 2.0 | 510.702313 | 50.742104 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
A one line cumsum trick solves it.
cumsum(c(0L, diff(df1$N) != 0))
#> [1] 0 1 1 2 2 3 3 4 4 4 5 5 6 7 8 9 10
all.equal(
cumsum(c(0L, diff(df1$N) != 0)),
df1$CN
)
#> [1] TRUE
Created on 2022-02-14 by the reprex package (v2.0.1)
Data
x <- "
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+"
df1 <- read.table(textConnection(x), header = TRUE, sep = "|", comment.char = "+")[2:3]
Created on 2022-02-14 by the reprex package (v2.0.1)

Hot Deck Imputation in Python

I have been trying to find Python code that would allow me to replace missing values in a dataframe's column. The focus of my analysis is in biostatistics so I am not comfortable with replacing values using means/medians/modes. I would like to apply the "Hot Deck Imputation" method.
I cannot find any Python functions or packages online that takes the column of a dataframe and fills missing values with the "Hot Deck Imputation" method.
I did, however, see this GitHub project and did not find it useful.
The following is an example of some of my data (assume this is a pandas dataframe):
| age | sex | bmi | anesthesia score | pain level |
|-----|-----|------|------------------|------------|
| 78 | 1 | 40.7 | 3 | 0 |
| 55 | 1 | 25.3 | 3 | 0 |
| 52 | 0 | 25.4 | 3 | 0 |
| 77 | 1 | 44.9 | 3 | 3 |
| 71 | 1 | 26.3 | 3 | 0 |
| 39 | 0 | 28.2 | 2 | 0 |
| 82 | 1 | 27 | 2 | 1 |
| 70 | 1 | 37.9 | 3 | 0 |
| 71 | 1 | NA | 3 | 1 |
| 53 | 0 | 24.5 | 2 | NA |
| 68 | 0 | 34.7 | 3 | 0 |
| 57 | 0 | 30.7 | 2 | 0 |
| 40 | 1 | 22.4 | 2 | 0 |
| 73 | 1 | 34.2 | 2 | 0 |
| 66 | 1 | NA | 3 | 1 |
| 55 | 1 | 42.6 | NA | NA |
| 53 | 0 | 37.5 | 3 | 3 |
| 65 | 0 | 31.6 | 2 | 2 |
| 36 | 0 | 29.6 | 1 | 0 |
| 60 | 0 | 25.7 | 2 | NA |
| 70 | 1 | 30 | NA | NA |
| 66 | 1 | 28.3 | 2 | 0 |
| 63 | 1 | 29.4 | 3 | 2 |
| 70 | 1 | 36 | 3 | 2 |
I would like to apply a Python function that would allow me to input a column as a parameter and return the column with the missing values replaced with imputed values using the "Hot Deck Imputation" method.
I am using this for the purpose of statistical modeling with models such as linear and logistic regression using Statsmodels.api. I am not using this for Machine Learning.
Any help would be much appreciated!
You can use ffill that uses last observation carried forward (LOCF) Hot Code Imputation.
#...
df.fillna(method='ffill', inplace=True)
Scikit-learn impute offers KNN, Mean, Max and other imputing methods. (https://scikit-learn.org/stable/modules/impute.html)
# sklearn '>=0.22.x'
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
DF['imputed_x'] = imputer.fit_transform(DF[['bmi']])
print(DF['imputed_x'])

Reattach ID column to data after passing through sklearn model

I am working on trying to create a predictive regression model that forecasts the completion date of a number of orders.
My dataset looks like:
| ORDER_NUMBER | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | Feature6 | TOTAL_DAYS_TO_COMPLETE | Feature8 | Feature9 | Feature10 | Feature11 | Feature12 | Feature13 | Feature14 | Feature15 | Feature16 | Feature17 | Feature18 | Feature19 | Feature20 | Feature21 | Feature22 | Feature23 | Feature24 | Feature25 | Feature26 | Feature27 | Feature28 | Feature29 | Feature30 | Feature31 |
|:------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------------------:|:--------:|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|
| 102203591 | 12 | 2014 | 10 | 2014 | 1 | 2015 | 760 | 50 | 83 | 5 | 6 | 12 | 18 | 31 | 8 | 0 | 1 | 0 | 1 | 16 | 131.29 | 24.3768 | 158.82 | 1.13 | 6.52 | 10 | 51 | 39 | 27 | 88 | 1084938 |
| 102231010 | 2 | 2015 | 1 | 2015 | 2 | 2015 | 706 | 35 | 34 | 2 | 1 | 4 | 3 | 3 | 3 | 0 | 0 | 0 | 1 | 2 | 11.95 | 5.162 | 17.83 | 1.14 | 3.45 | 1 | 4 | 20 | 16 | 25 | 367140 |
| 102251893 | 6 | 2015 | 4 | 2015 | 3 | 2015 | 1143 | 36 | 43 | 1 | 2 | 4 | 5 | 6 | 3 | 1 | 0 | 0 | 1 | 5 | 8.55 | 5.653 | 34.51 | 4.59 | 6.1 | 0 | 1 | 17 | 30 | 12 | 103906 |
| 102287793 | 4 | 2015 | 2 | 2015 | 4 | 2015 | 733 | 45 | 71 | 4 | 1 | 6 | 35 | 727 | 6 | 0 | 3 | 15 | 0 | 19 | 174.69 | 97.448 | 319.98 | 1.49 | 3.28 | 20 | 113 | 71 | 59 | 71 | 1005041 |
| 102288060 | 6 | 2015 | 5 | 2015 | 4 | 2015 | 1092 | 26 | 21 | 1 | 1 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 4.73 | 4.5363 | 18.85 | 3.11 | 4.16 | 0 | 1 | 16 | 8 | 16 | 69062 |
| 102308069 | 8 | 2015 | 6 | 2015 | 5 | 2015 | 676 | 41 | 34 | 2 | 0 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 2.98 | 6.1173 | 11.3 | 1.36 | 1.85 | 0 | 1 | 17 | 12 | 3 | 145887 |
| 102319918 | 8 | 2015 | 7 | 2015 | 6 | 2015 | 884 | 25 | 37 | 1 | 1 | 3 | 2 | 3 | 2 | 0 | 0 | 1 | 0 | 2 | 5.57 | 3.7083 | 9.18 | 0.97 | 2.48 | 0 | 1 | 14 | 5 | 7 | 45243 |
| 102327578 | 6 | 2015 | 4 | 2015 | 6 | 2015 | 595 | 49 | 68 | 3 | 5 | 9 | 11 | 13 | 5 | 4 | 2 | 0 | 1 | 10 | 55.41 | 24.3768 | 104.98 | 2.03 | 4.31 | 10 | 51 | 39 | 26 | 40 | 418266 |
| 102337989 | 7 | 2015 | 5 | 2015 | 7 | 2015 | 799 | 50 | 66 | 5 | 6 | 12 | 21 | 29 | 12 | 0 | 0 | 0 | 1 | 20 | 138.79 | 24.3768 | 172.56 | 1.39 | 7.08 | 10 | 51 | 39 | 34 | 101 | 1229299 |
| 102450069 | 8 | 2015 | 7 | 2015 | 11 | 2015 | 456 | 20 | 120 | 2 | 1 | 3 | 12 | 14 | 8 | 0 | 0 | 0 | 0 | 7 | 2.92 | 6.561 | 12.3 | 1.43 | 1.87 | 2 | 1 | 15 | 6 | 6 | 142805 |
| 102514564 | 5 | 2016 | 3 | 2016 | 2 | 2016 | 639 | 25 | 35 | 1 | 2 | 4 | 3 | 6 | 3 | 0 | 0 | 0 | 0 | 3 | 4.83 | 4.648 | 14.22 | 2.02 | 3.06 | 0 | 1 | 15 | 5 | 13 | 62941 |
| 102528121 | 10 | 2015 | 9 | 2015 | 3 | 2016 | 413 | 15 | 166 | 1 | 1 | 3 | 2 | 3 | 2 | 0 | 0 | 0 | 0 | 2 | 4.23 | 1.333 | 15.78 | 8.66 | 11.84 | 1 | 4 | 8 | 6 | 3 | 111752 |
| 102564376 | 1 | 2016 | 12 | 2015 | 4 | 2016 | 802 | 27 | 123 | 2 | 1 | 4 | 3 | 3 | 3 | 0 | 1 | 0 | 0 | 3 | 1.27 | 2.063 | 6.9 | 2.73 | 3.34 | 1 | 4 | 14 | 20 | 6 | 132403 |
| 102564472 | 1 | 2016 | 12 | 2015 | 4 | 2016 | 817 | 27 | 123 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1.03 | 2.063 | 9.86 | 4.28 | 4.78 | 1 | 4 | 14 | 22 | 4 | 116907 |
| 102599569 | 2 | 2016 | 12 | 2015 | 5 | 2016 | 425 | 47 | 151 | 1 | 2 | 4 | 3 | 4 | 3 | 0 | 0 | 0 | 0 | 2 | 27.73 | 15.8993 | 60.5 | 2.06 | 3.81 | 12 | 108 | 34 | 24 | 20 | 119743 |
| 102599628 | 2 | 2016 | 12 | 2015 | 5 | 2016 | 425 | 47 | 151 | 3 | 4 | 8 | 8 | 9 | 7 | 0 | 0 | 0 | 2 | 8 | 39.28 | 14.8593 | 91.26 | 3.5 | 6.14 | 12 | 108 | 34 | 38 | 15 | 173001 |
| 102606421 | 3 | 2016 | 12 | 2015 | 5 | 2016 | 965 | 55 | 161 | 5 | 11 | 17 | 29 | 44 | 11 | 1 | 1 | 0 | 1 | 22 | 148.06 | 23.7983 | 195.69 | 2 | 8.22 | 10 | 51 | 39 | 47 | 112 | 1196097 |
| 102621293 | 7 | 2016 | 5 | 2016 | 6 | 2016 | 701 | 42 | 27 | 2 | 1 | 4 | 3 | 3 | 1 | 0 | 0 | 0 | 1 | 2 | 8.39 | 3.7455 | 13.93 | 1.48 | 3.72 | 1 | 5 | 14 | 14 | 20 | 258629 |
| 102632364 | 7 | 2016 | 6 | 2016 | 6 | 2016 | 982 | 41 | 26 | 4 | 2 | 7 | 6 | 6 | 2 | 0 | 0 | 0 | 1 | 4 | 26.07 | 2.818 | 37.12 | 3.92 | 13.17 | 1 | 5 | 14 | 22 | 10 | 167768 |
| 102643207 | 9 | 2016 | 9 | 2016 | 7 | 2016 | 255 | 9 | 73 | 3 | 1 | 5 | 4 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 2.17 | 0.188 | 4.98 | 14.95 | 26.49 | 1 | 4 | 2 | 11 | 1 | 49070 |
| 102656091 | 9 | 2016 | 8 | 2016 | 7 | 2016 | 356 | 21 | 35 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1.45 | 2.0398 | 5.54 | 2.01 | 2.72 | 1 | 4 | 14 | 15 | 3 | 117107 |
| 102660407 | 9 | 2016 | 8 | 2016 | 7 | 2016 | 462 | 21 | 31 | 2 | 0 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 3.18 | 2.063 | 8.76 | 2.7 | 4.25 | 1 | 4 | 14 | 14 | 10 | 151272 |
| 102665666 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0.188 | 2.95 | 10.37 | 15.69 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102665667 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.72 | 0.188 | 2.22 | 7.98 | 11.81 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102665668 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.9 | 0.188 | 2.24 | 7.13 | 11.91 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102666306 | 7 | 2016 | 6 | 2016 | 7 | 2016 | 235 | 16 | 34 | 3 | 1 | 5 | 5 | 6 | 4 | 0 | 0 | 0 | 0 | 3 | 14.06 | 3.3235 | 31.27 | 5.18 | 9.41 | 1 | 1 | 16 | 5 | 18 | 246030 |
| 102668177 | 8 | 2016 | 6 | 2016 | 8 | 2016 | 233 | 36 | 32 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 2.5 | 5.2043 | 8.46 | 1.15 | 1.63 | 0 | 1 | 14 | 2 | 4 | 89059 |
| 102669909 | 6 | 2016 | 4 | 2016 | 8 | 2016 | 244 | 46 | 105 | 4 | 11 | 16 | 28 | 30 | 15 | 1 | 2 | 1 | 1 | 25 | 95.49 | 26.541 | 146.89 | 1.94 | 5.53 | 1 | 51 | 33 | 9 | 48 | 78488 |
| 102670188 | 5 | 2016 | 4 | 2016 | 8 | 2016 | 413 | 20 | 109 | 1 | 1 | 2 | 2 | 3 | 2 | 0 | 0 | 0 | 0 | 1 | 2.36 | 6.338 | 8.25 | 0.93 | 1.3 | 2 | 1 | 14 | 5 | 3 | 117137 |
| 102671063 | 8 | 2016 | 6 | 2016 | 8 | 2016 | 296 | 46 | 44 | 2 | 4 | 7 | 7 | 111 | 3 | 1 | 0 | 1 | 0 | 7 | 12.96 | 98.748 | 146.24 | 1.35 | 1.48 | 20 | 113 | 70 | 26 | 9 | 430192 |
| 102672475 | 8 | 2016 | 7 | 2016 | 8 | 2016 | 217 | 20 | 23 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0.5 | 4.9093 | 5.37 | 0.99 | 1.09 | 0 | 1 | 16 | 0 | 1 | 116673 |
| 102672477 | 10 | 2016 | 9 | 2016 | 8 | 2016 | 194 | 20 | 36 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.61 | 5.1425 | 3.65 | 0.59 | 0.71 | 0 | 1 | 16 | 0 | 2 | 98750 |
| 102672513 | 10 | 2016 | 9 | 2016 | 8 | 2016 | 228 | 20 | 36 | 1 | 1 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0.25 | 5.1425 | 6.48 | 1.21 | 1.26 | 0 | 1 | 16 | 0 | 2 | 116780 |
| 102682943 | 5 | 2016 | 4 | 2016 | 8 | 2016 | 417 | 20 | 113 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.64 | 6.338 | 5.53 | 0.77 | 0.87 | 2 | 1 | 14 | 5 | 2 | 100307 |
The ORDER_NUMBER should not be a feature in the model -- it is a unique identifier that I would like to essentially not count in the model since it is just a random ID, but include in the final dataset, so I can tie back predictions and actual values to the order.
Currently, my code looks like this:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances.sort_values(by='Gini-importance', ascending = False)
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def print_to_file(filepath, arr):
with open(filepath, 'w') as f:
for item in arr:
f.write("%s\n" % item)
# READ IN THE DATA TABLE ABOVE
data = pd.read_csv('test.csv')
# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# Remove the order number since we don't need it
data = data.drop('ORDER_NUMBER', axis=1)
# remove the header
data = data[1:]
# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)
rf = RandomForestRegressor(
bootstrap = True,
max_depth = None,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
If I print(y_test) and print(rf_predictions) I get something like:
**print(y_test)**
7
155
84
64
49
41
200
168
43
111
64
46
96
47
50
27
216
..
**print(rf_predictions)**
34.496
77.366
69.6105
61.6825
80.8495
79.8785
177.5465
129.014
70.0405
97.3975
82.4435
57.9575
108.018
57.5515
..
And it works. If I print out y_test and rf_predictions, I get the labels for the test data and the predicted label values.
However, I would like to see what orders are associated both the y_test values and the rf_predictions values. How can I keep that dataset and create a dataframe (like below):
| Order Number | Predicted Value | Actual Value |
|--------------|-------------------|--------------|
| Foo0 | 34.496 | 7 |
| Foo1 | 77.366 | 155 |
| Foo2 | 69.6105 | 84 |
| Foo3 | 61.6825 | 64 |
I have tried looking at this post but I could not get a solution. I did try print(y_test, rf_predictions) and that did not do any good since I have .drop() the ORDER_NUMBER field.
As you're using pandas dataframes the index is retained in all your x/y train/test datasets, so you can re-assemble it after you applied the model. We just need to save the order numbers before dropping that column: order_numbers = data['ORDER_NUMBER']. The predictions rf_predictions are returned in the same order as the input data to rf.predict(X_test), i.e. rf_predictions[i] belongs to X_test.iloc[i].
This creates your required result dataset:
res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
Btw, data = data[1:] doesn't remove the header, it removes the first row, so there's no need to remove anything when you work with pandas dataframes.
So the final program will be:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances.sort_values(by='Gini-importance', ascending = False)
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def print_to_file(filepath, arr):
with open(filepath, 'w') as f:
for item in arr:
f.write("%s\n" % item)
# READ IN THE DATA TABLE ABOVE
data = pd.read_csv('test.csv')
# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# Remove the order number since we don't need it
order_numbers = data['ORDER_NUMBER']
data = data.drop('ORDER_NUMBER', axis=1)
# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)
rf = RandomForestRegressor(
bootstrap = True,
max_depth = None,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
print(res)
With your example data from above we get (for train_test_split with random_state=1):
ORDER_NUMBER Predicted Value Actual Value
3 102287793 652.0746 733
14 102599569 650.3984 425
19 102643207 319.4964 255
20 102656091 388.6004 356
26 102668177 475.1724 233
27 102669909 671.9158 244
32 102672513 319.1550 228

Categories

Resources