I've read the several threads about plotting grouped Seaborn Boxplots, but I was wondering, if there is a simpler solution as a one-liner?
The Pandas Dataframe contains something along the lines of:
Index xaxis yaxis1 xayis2
0 A 30 1985
1 A 29 2002
2 B 21 3034
3 A 31 2087
4 B 19 2931
5 B 21 2832
6 A 28 1950
sns.boxplot(x='xaxis', y=['yaxis1','yaxis2'], data=df);
doesn't work (for probably obvious reasons), while
sns.boxplot(x='xaxis', y='yaxis1', data=df);
or
sns.boxplot(x='xaxis', y='yaxis2', data=df);
work just fine for the separate plots. I also tried using
sns.boxplot(df['xaxis'], df[['yaxis1','yaxis2']])
but no luck therewith either...
I want both yaxis columns combined into a single boxplot, similar to this one https://seaborn.pydata.org/examples/grouped_boxplot.html, but I can't use hue=, as the data for both y axes is continuous.
Any way I can do that with the one line sprint, or is it inevitable to run the whole marathon?
If you want to create grouped boxplots with seaborn, you have to use hue=. The trick is to create a long-form dataframe where all your yaxis{1,2} values are in one column, and an other column instructs which of the two original columns each row comes from.
This is accomplished using DataFrame.melt():
df
| Index | xaxis | yaxis1 | xayis2 |
|--------:|:--------|---------:|---------:|
| 0 | A | 30 | 1985 |
| 1 | A | 29 | 2002 |
| 2 | B | 21 | 3034 |
| 3 | A | 31 | 2087 |
| 4 | B | 19 | 2931 |
| 5 | B | 21 | 2832 |
| 6 | A | 28 | 1950 |
df2 = df.melt(id_vars=['xaxis'], var_name='yaxis')
| | xaxis | yaxis | value |
|---:|:--------|:--------|--------:|
| 0 | A | yaxis1 | 30 |
| 1 | A | yaxis1 | 29 |
| 2 | B | yaxis1 | 21 |
| 3 | A | yaxis1 | 31 |
| 4 | B | yaxis1 | 19 |
| 5 | B | yaxis1 | 21 |
| 6 | A | yaxis1 | 28 |
| 7 | A | xayis2 | 1985 |
| 8 | A | xayis2 | 2002 |
| 9 | B | xayis2 | 3034 |
| 10 | A | xayis2 | 2087 |
| 11 | B | xayis2 | 2931 |
| 12 | B | xayis2 | 2832 |
| 13 | A | xayis2 | 1950 |
sns.boxplot(x='xaxis', y='value', hue='yaxis', data=df2)
Related
I'd like to create a table from a data frame with subtotals per business, totals per business type, and columns summing multiple value columns. Long term is to create a selection tool based on the ingested Excel sheet for whichever month's summary I bring in to compare month summaries (e.g. did minerals item 26 from BA3 disappear the next month) but I believe that is best saved for another question.
For now, I am having trouble figuring out how to summarize the data.
I have a dataframe in Pandas that contains the following:
Business | Business Type | ID | Value-Q1 | Value-Q2 | Value-Q3 | Value-Q4 | Value-FY |
---------+---------------+----+----------+----------+----------+----------+----------+
BA1 | Widgets | 1 | 7 | 0 | 0 | 8 | 15 |
BA1 | Widgets | 2 | 7 | 0 | 0 | 8 | 15 |
BA1 | Cups | 3 | 9 | 10 | 0 | 0 | 19 |
BA1 | Cups | 4 | 9 | 10 | 0 | 0 | 19 |
BA1 | Cups | 5 | 9 | 10 | 0 | 0 | 19 |
BA1 | Snorkels | 6 | 0 | 0 | 8 | 8 | 16 |
BA1 | Snorkels | 7 | 0 | 0 | 8 | 8 | 16 |
BA1 | Snorkels | 8 | 0 | 0 | 8 | 8 | 16 |
BA2 | Widgets | 9 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 10 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 11 | 100 | 0 | 7 | 0 | 107 |
BA2 | Widgets | 12 | 100 | 0 | 7 | 0 | 107 |
BA2 | Bread | 13 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 14 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 15 | 0 | 0 | 0 | 1 | 1 |
BA2 | Bread | 16 | 0 | 0 | 0 | 1 | 1 |
BA2 | Cat Food | 17 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 18 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 19 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 20 | 504 | 0 | 0 | 500 | 1004 |
BA2 | Cat Food | 21 | 504 | 0 | 0 | 500 | 1004 |
BA3 | Gravel | 22 | 7 | 7 | 7 | 7 | 28 |
BA3 | Gravel | 23 | 7 | 7 | 7 | 7 | 28 |
BA3 | Gravel | 24 | 7 | 7 | 7 | 7 | 28 |
BA3 | Rocks | 25 | 3 | 2 | 0 | 0 | 5 |
BA3 | Minerals | 26 | 1 | 1 | 0 | 1 | 3 |
BA3 | Minerals | 27 | 1 | 1 | 0 | 1 | 3 |
BA4 | Widgets | 28 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 29 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 30 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 31 | 6 | 4 | 0 | 0 | 10 |
BA4 | Widgets | 32 | 6 | 4 | 0 | 0 | 10 |
BA4 | Something | 33 | 1000 | 0 | 0 | 2 | 1002 |
BA5 | Bonbons | 34 | 60 | 40 | 10 | 0 | 110 |
BA5 | Bonbons | 35 | 60 | 40 | 10 | 0 | 110 |
BA5 | Gummy Bears | 36 | 7 | 0 | 0 | 9 | 16 |
(Imagine each ID has different values as well)
My goal is to slice the data to get the total occurrences of a given business type (e.g. BA 1 has 2 widgets, 3 cups, and 3 snorkels which each have a unique ID) as well as the total values:
Occurrence | Q1 Sum | Q2 Sum | Q3 Sum | Q4 Sum | FY Sum |
BA 1 8 | 41 | 30 | 24 | 40 | 135 |
Widgets 2 | 14 | 0 | 0 | 16 | 30 |
Cups 3 | 27 | 30 | 0 | 0 | 57 |
Snorkels 3 | 0 | 0 | 24 | 24 | 48 |
BA 2 Subtotal of BA2 items below
Widgets Repeat Above
Bread Repeat Above
Cat Food Repeat Above
I have more columns that mirror the Q1-FY columns with other fields (e.g. Value 2 Q1-FY) per line as well that I would like to include on the summary but I imagine I could just repeat whatever process is used to grab the current Value cuts.
I have a list of unique Businesses
businesses = [BA1, BA2, BA3, BA4, BA5]
and a list of unique Business Types
[Widgets, Cups, Snorkels, Bread, Cat Food, Gravel, Rocks, Minerals, Something, Bonbons, Gummy Bears]
and finally a list of the Values
values = [Value-Q1, Value-Q2, Value-Q3, Value-Q4, Value-FY]
and I tried doing a for loop off of the lists
maybe I need to make the dataframe values be on their own individual lines? I tried the following for at least the sum of FY
for b in businesses
for bt in business types
df_sums = df.loc['Business' == b, 'Business Type' == bt, 'Value-FY'].sum()
but it didn't quite give me what I was hoping for
I'm sure there's a better way to at least grab the values (I managed to get FY values per business into a dictionary) for totals but not totals per business per business type (which is also unique per business).
If anyone has any advice or can point me in the right direction I'd really appreciate it!
You should try to use the group_by method for this. Group_by allows for several grouping options. I have attached a link to the documentation on the method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
So I have following dask dataframe grouped by Problem column.
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 5 | 2 | 15 | 38 |
| A | 15 | 2 | 15 | 23 |
| B | 11 | 6 | 10 | 54 |
| B | 10 | 6 | 10 | 48 |
| B | 18 | 6 | 10 | 79 |
| C | 50 | 8 | 25 | 120 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
The goal is to create a new dataframe with all rows where the Cost values is minimal for this particular Problem group. So we want following result:
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 15 | 2 | 15 | 23 |
| B | 10 | 6 | 10 | 48 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
How can I achieve this result, i already tried using idxmin() as mentioned in another question on here, but then I get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
What if you create another dataframe that is grouped by Problem and Cost.min()? Let's say the new column is called cost_min.
df1 = df.groupby('Problem')['Cost'].min().reset_index()
Then, merge back this new cost_min column back to the dataframe.
df2 = pd.merge(df, df1, how='left', on='Problem')
From there, do something like:
df_new = df2.loc[df2['Cost'] == df2['cost_min']]
Just wrote some pseudocode, but I think that all works with Dask.
I am working on trying to create a predictive regression model that forecasts the completion date of a number of orders.
My dataset looks like:
| ORDER_NUMBER | Feature1 | Feature2 | Feature3 | Feature4 | Feature5 | Feature6 | TOTAL_DAYS_TO_COMPLETE | Feature8 | Feature9 | Feature10 | Feature11 | Feature12 | Feature13 | Feature14 | Feature15 | Feature16 | Feature17 | Feature18 | Feature19 | Feature20 | Feature21 | Feature22 | Feature23 | Feature24 | Feature25 | Feature26 | Feature27 | Feature28 | Feature29 | Feature30 | Feature31 |
|:------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------------------:|:--------:|:--------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|
| 102203591 | 12 | 2014 | 10 | 2014 | 1 | 2015 | 760 | 50 | 83 | 5 | 6 | 12 | 18 | 31 | 8 | 0 | 1 | 0 | 1 | 16 | 131.29 | 24.3768 | 158.82 | 1.13 | 6.52 | 10 | 51 | 39 | 27 | 88 | 1084938 |
| 102231010 | 2 | 2015 | 1 | 2015 | 2 | 2015 | 706 | 35 | 34 | 2 | 1 | 4 | 3 | 3 | 3 | 0 | 0 | 0 | 1 | 2 | 11.95 | 5.162 | 17.83 | 1.14 | 3.45 | 1 | 4 | 20 | 16 | 25 | 367140 |
| 102251893 | 6 | 2015 | 4 | 2015 | 3 | 2015 | 1143 | 36 | 43 | 1 | 2 | 4 | 5 | 6 | 3 | 1 | 0 | 0 | 1 | 5 | 8.55 | 5.653 | 34.51 | 4.59 | 6.1 | 0 | 1 | 17 | 30 | 12 | 103906 |
| 102287793 | 4 | 2015 | 2 | 2015 | 4 | 2015 | 733 | 45 | 71 | 4 | 1 | 6 | 35 | 727 | 6 | 0 | 3 | 15 | 0 | 19 | 174.69 | 97.448 | 319.98 | 1.49 | 3.28 | 20 | 113 | 71 | 59 | 71 | 1005041 |
| 102288060 | 6 | 2015 | 5 | 2015 | 4 | 2015 | 1092 | 26 | 21 | 1 | 1 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 4.73 | 4.5363 | 18.85 | 3.11 | 4.16 | 0 | 1 | 16 | 8 | 16 | 69062 |
| 102308069 | 8 | 2015 | 6 | 2015 | 5 | 2015 | 676 | 41 | 34 | 2 | 0 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 2.98 | 6.1173 | 11.3 | 1.36 | 1.85 | 0 | 1 | 17 | 12 | 3 | 145887 |
| 102319918 | 8 | 2015 | 7 | 2015 | 6 | 2015 | 884 | 25 | 37 | 1 | 1 | 3 | 2 | 3 | 2 | 0 | 0 | 1 | 0 | 2 | 5.57 | 3.7083 | 9.18 | 0.97 | 2.48 | 0 | 1 | 14 | 5 | 7 | 45243 |
| 102327578 | 6 | 2015 | 4 | 2015 | 6 | 2015 | 595 | 49 | 68 | 3 | 5 | 9 | 11 | 13 | 5 | 4 | 2 | 0 | 1 | 10 | 55.41 | 24.3768 | 104.98 | 2.03 | 4.31 | 10 | 51 | 39 | 26 | 40 | 418266 |
| 102337989 | 7 | 2015 | 5 | 2015 | 7 | 2015 | 799 | 50 | 66 | 5 | 6 | 12 | 21 | 29 | 12 | 0 | 0 | 0 | 1 | 20 | 138.79 | 24.3768 | 172.56 | 1.39 | 7.08 | 10 | 51 | 39 | 34 | 101 | 1229299 |
| 102450069 | 8 | 2015 | 7 | 2015 | 11 | 2015 | 456 | 20 | 120 | 2 | 1 | 3 | 12 | 14 | 8 | 0 | 0 | 0 | 0 | 7 | 2.92 | 6.561 | 12.3 | 1.43 | 1.87 | 2 | 1 | 15 | 6 | 6 | 142805 |
| 102514564 | 5 | 2016 | 3 | 2016 | 2 | 2016 | 639 | 25 | 35 | 1 | 2 | 4 | 3 | 6 | 3 | 0 | 0 | 0 | 0 | 3 | 4.83 | 4.648 | 14.22 | 2.02 | 3.06 | 0 | 1 | 15 | 5 | 13 | 62941 |
| 102528121 | 10 | 2015 | 9 | 2015 | 3 | 2016 | 413 | 15 | 166 | 1 | 1 | 3 | 2 | 3 | 2 | 0 | 0 | 0 | 0 | 2 | 4.23 | 1.333 | 15.78 | 8.66 | 11.84 | 1 | 4 | 8 | 6 | 3 | 111752 |
| 102564376 | 1 | 2016 | 12 | 2015 | 4 | 2016 | 802 | 27 | 123 | 2 | 1 | 4 | 3 | 3 | 3 | 0 | 1 | 0 | 0 | 3 | 1.27 | 2.063 | 6.9 | 2.73 | 3.34 | 1 | 4 | 14 | 20 | 6 | 132403 |
| 102564472 | 1 | 2016 | 12 | 2015 | 4 | 2016 | 817 | 27 | 123 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1.03 | 2.063 | 9.86 | 4.28 | 4.78 | 1 | 4 | 14 | 22 | 4 | 116907 |
| 102599569 | 2 | 2016 | 12 | 2015 | 5 | 2016 | 425 | 47 | 151 | 1 | 2 | 4 | 3 | 4 | 3 | 0 | 0 | 0 | 0 | 2 | 27.73 | 15.8993 | 60.5 | 2.06 | 3.81 | 12 | 108 | 34 | 24 | 20 | 119743 |
| 102599628 | 2 | 2016 | 12 | 2015 | 5 | 2016 | 425 | 47 | 151 | 3 | 4 | 8 | 8 | 9 | 7 | 0 | 0 | 0 | 2 | 8 | 39.28 | 14.8593 | 91.26 | 3.5 | 6.14 | 12 | 108 | 34 | 38 | 15 | 173001 |
| 102606421 | 3 | 2016 | 12 | 2015 | 5 | 2016 | 965 | 55 | 161 | 5 | 11 | 17 | 29 | 44 | 11 | 1 | 1 | 0 | 1 | 22 | 148.06 | 23.7983 | 195.69 | 2 | 8.22 | 10 | 51 | 39 | 47 | 112 | 1196097 |
| 102621293 | 7 | 2016 | 5 | 2016 | 6 | 2016 | 701 | 42 | 27 | 2 | 1 | 4 | 3 | 3 | 1 | 0 | 0 | 0 | 1 | 2 | 8.39 | 3.7455 | 13.93 | 1.48 | 3.72 | 1 | 5 | 14 | 14 | 20 | 258629 |
| 102632364 | 7 | 2016 | 6 | 2016 | 6 | 2016 | 982 | 41 | 26 | 4 | 2 | 7 | 6 | 6 | 2 | 0 | 0 | 0 | 1 | 4 | 26.07 | 2.818 | 37.12 | 3.92 | 13.17 | 1 | 5 | 14 | 22 | 10 | 167768 |
| 102643207 | 9 | 2016 | 9 | 2016 | 7 | 2016 | 255 | 9 | 73 | 3 | 1 | 5 | 4 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 2.17 | 0.188 | 4.98 | 14.95 | 26.49 | 1 | 4 | 2 | 11 | 1 | 49070 |
| 102656091 | 9 | 2016 | 8 | 2016 | 7 | 2016 | 356 | 21 | 35 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1.45 | 2.0398 | 5.54 | 2.01 | 2.72 | 1 | 4 | 14 | 15 | 3 | 117107 |
| 102660407 | 9 | 2016 | 8 | 2016 | 7 | 2016 | 462 | 21 | 31 | 2 | 0 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 2 | 3.18 | 2.063 | 8.76 | 2.7 | 4.25 | 1 | 4 | 14 | 14 | 10 | 151272 |
| 102665666 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0.188 | 2.95 | 10.37 | 15.69 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102665667 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.72 | 0.188 | 2.22 | 7.98 | 11.81 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102665668 | 10 | 2016 | 9 | 2016 | 7 | 2016 | 235 | 9 | 64 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0.9 | 0.188 | 2.24 | 7.13 | 11.91 | 1 | 4 | 2 | 10 | 1 | 52578 |
| 102666306 | 7 | 2016 | 6 | 2016 | 7 | 2016 | 235 | 16 | 34 | 3 | 1 | 5 | 5 | 6 | 4 | 0 | 0 | 0 | 0 | 3 | 14.06 | 3.3235 | 31.27 | 5.18 | 9.41 | 1 | 1 | 16 | 5 | 18 | 246030 |
| 102668177 | 8 | 2016 | 6 | 2016 | 8 | 2016 | 233 | 36 | 32 | 0 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 2.5 | 5.2043 | 8.46 | 1.15 | 1.63 | 0 | 1 | 14 | 2 | 4 | 89059 |
| 102669909 | 6 | 2016 | 4 | 2016 | 8 | 2016 | 244 | 46 | 105 | 4 | 11 | 16 | 28 | 30 | 15 | 1 | 2 | 1 | 1 | 25 | 95.49 | 26.541 | 146.89 | 1.94 | 5.53 | 1 | 51 | 33 | 9 | 48 | 78488 |
| 102670188 | 5 | 2016 | 4 | 2016 | 8 | 2016 | 413 | 20 | 109 | 1 | 1 | 2 | 2 | 3 | 2 | 0 | 0 | 0 | 0 | 1 | 2.36 | 6.338 | 8.25 | 0.93 | 1.3 | 2 | 1 | 14 | 5 | 3 | 117137 |
| 102671063 | 8 | 2016 | 6 | 2016 | 8 | 2016 | 296 | 46 | 44 | 2 | 4 | 7 | 7 | 111 | 3 | 1 | 0 | 1 | 0 | 7 | 12.96 | 98.748 | 146.24 | 1.35 | 1.48 | 20 | 113 | 70 | 26 | 9 | 430192 |
| 102672475 | 8 | 2016 | 7 | 2016 | 8 | 2016 | 217 | 20 | 23 | 0 | 1 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0.5 | 4.9093 | 5.37 | 0.99 | 1.09 | 0 | 1 | 16 | 0 | 1 | 116673 |
| 102672477 | 10 | 2016 | 9 | 2016 | 8 | 2016 | 194 | 20 | 36 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.61 | 5.1425 | 3.65 | 0.59 | 0.71 | 0 | 1 | 16 | 0 | 2 | 98750 |
| 102672513 | 10 | 2016 | 9 | 2016 | 8 | 2016 | 228 | 20 | 36 | 1 | 1 | 3 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0.25 | 5.1425 | 6.48 | 1.21 | 1.26 | 0 | 1 | 16 | 0 | 2 | 116780 |
| 102682943 | 5 | 2016 | 4 | 2016 | 8 | 2016 | 417 | 20 | 113 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0.64 | 6.338 | 5.53 | 0.77 | 0.87 | 2 | 1 | 14 | 5 | 2 | 100307 |
The ORDER_NUMBER should not be a feature in the model -- it is a unique identifier that I would like to essentially not count in the model since it is just a random ID, but include in the final dataset, so I can tie back predictions and actual values to the order.
Currently, my code looks like this:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances.sort_values(by='Gini-importance', ascending = False)
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def print_to_file(filepath, arr):
with open(filepath, 'w') as f:
for item in arr:
f.write("%s\n" % item)
# READ IN THE DATA TABLE ABOVE
data = pd.read_csv('test.csv')
# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# Remove the order number since we don't need it
data = data.drop('ORDER_NUMBER', axis=1)
# remove the header
data = data[1:]
# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)
rf = RandomForestRegressor(
bootstrap = True,
max_depth = None,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
If I print(y_test) and print(rf_predictions) I get something like:
**print(y_test)**
7
155
84
64
49
41
200
168
43
111
64
46
96
47
50
27
216
..
**print(rf_predictions)**
34.496
77.366
69.6105
61.6825
80.8495
79.8785
177.5465
129.014
70.0405
97.3975
82.4435
57.9575
108.018
57.5515
..
And it works. If I print out y_test and rf_predictions, I get the labels for the test data and the predicted label values.
However, I would like to see what orders are associated both the y_test values and the rf_predictions values. How can I keep that dataset and create a dataframe (like below):
| Order Number | Predicted Value | Actual Value |
|--------------|-------------------|--------------|
| Foo0 | 34.496 | 7 |
| Foo1 | 77.366 | 155 |
| Foo2 | 69.6105 | 84 |
| Foo3 | 61.6825 | 64 |
I have tried looking at this post but I could not get a solution. I did try print(y_test, rf_predictions) and that did not do any good since I have .drop() the ORDER_NUMBER field.
As you're using pandas dataframes the index is retained in all your x/y train/test datasets, so you can re-assemble it after you applied the model. We just need to save the order numbers before dropping that column: order_numbers = data['ORDER_NUMBER']. The predictions rf_predictions are returned in the same order as the input data to rf.predict(X_test), i.e. rf_predictions[i] belongs to X_test.iloc[i].
This creates your required result dataset:
res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
Btw, data = data[1:] doesn't remove the header, it removes the first row, so there's no need to remove anything when you work with pandas dataframes.
So the final program will be:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
return importances.sort_values(by='Gini-importance', ascending = False)
def compare_values(arr1, arr2):
thediff = 0
thediffs = []
for thing1, thing2 in zip(arr1, arr2):
thediff = abs(thing1 - thing2)
thediffs.append(thediff)
return thediffs
def print_to_file(filepath, arr):
with open(filepath, 'w') as f:
for item in arr:
f.write("%s\n" % item)
# READ IN THE DATA TABLE ABOVE
data = pd.read_csv('test.csv')
# create the labels, or field we are trying to estimate
label = data['TOTAL_DAYS_TO_COMPLETE']
# create the data, or the data that is to be estimated
data = data.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# Remove the order number since we don't need it
order_numbers = data['ORDER_NUMBER']
data = data.drop('ORDER_NUMBER', axis=1)
# # split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, label, test_size = 0.2)
rf = RandomForestRegressor(
bootstrap = True,
max_depth = None,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 5000
)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
rf_differences = compare_values(y_test, rf_predictions)
rf_Avg = np.average(rf_differences)
print("#################################################")
print("DATA FOR RANDOM FORESTS")
print(rf_Avg)
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
res = y_test.to_frame('Actual Value')
res.insert(0, 'Predicted Value', rf_predictions)
res = order_numbers.to_frame().join(res, how='inner')
print(res)
With your example data from above we get (for train_test_split with random_state=1):
ORDER_NUMBER Predicted Value Actual Value
3 102287793 652.0746 733
14 102599569 650.3984 425
19 102643207 319.4964 255
20 102656091 388.6004 356
26 102668177 475.1724 233
27 102669909 671.9158 244
32 102672513 319.1550 228
i have a table with 4 columns , from this data i obtained another 2 tables with some rolling averages from the original table. now i want to combine these 3 into a final table. but the indexes are not in order now and i cant do it. I just started to learn python , i have zero experience and i would really need all the help i can get.
DF
+----+------------+-----------+------+------+
| | A | B | C | D |
+----+------------+-----------+------+------+
| 1 | Home Team | Away Team | Htgs | Atgs |
| 2 | dalboset | sopot | 1 | 2 |
| 3 | calnic | resita | 1 | 3 |
| 4 | sopot | dalboset | 2 | 2 |
| 5 | resita | sopot | 4 | 1 |
| 6 | sopot | dalboset | 2 | 1 |
| 7 | caransebes | dalboset | 1 | 2 |
| 8 | calnic | resita | 1 | 3 |
| 9 | dalboset | sopot | 2 | 2 |
| 10 | calnic | resita | 4 | 1 |
| 11 | sopot | dalboset | 2 | 1 |
| 12 | resita | sopot | 1 | 2 |
| 13 | sopot | dalboset | 1 | 3 |
| 14 | caransebes | dalboset | 2 | 2 |
| 15 | calnic | resita | 4 | 1 |
| 16 | dalboset | sopot | 2 | 1 |
| 17 | calnic | resita | 1 | 2 |
| 18 | sopot | dalboset | 4 | 1 |
| 19 | resita | sopot | 2 | 1 |
| 20 | sopot | dalboset | 1 | 2 |
| 21 | caransebes | dalboset | 1 | 3 |
| 22 | calnic | resita | 2 | 2 |
+----+------------+-----------+------+------+
CODE
df1 = df.groupby('Home Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df1 =df1.rename(columns={'Htgs': 'Htgs/3', 'Atgs': 'Htgc/3'})
df1
df2 = df.groupby('Away Team',) ['Htgs', 'Atgs',].rolling(window=4, min_periods=3).mean()
df2 =df2.rename(columns={'Htgs': 'Atgc/3', 'Atgs': 'Atgs/3'})
df2
now i need a solution to see the columns with the rolling average next to the Home Team,,,,Away Team,,,,Htgs,,,,Atgs from the original table
Done !
i create a new column direct in the data frame like this
df = pd.read_csv('Fd.csv', )
df['Htgs/3'] = df.groupby('Home Team', ) ['Htgs'].rolling(window=4, min_periods=3).mean().reset_index(0,drop=True)
the Htgs/3 will be the new column with the rolling average of the column Home Team, and for the rest i will do the same like in this part.
How can I merge/join these two dataframes ONLY on "sample_id" and drop the extra rows from the second dataframe when merging/joining?
Using pandas in Python.
First dataframe (fdf)
| sample_id | name |
|-----------|-------|
| 1 | Mark |
| 1 | Dart |
| 2 | Julia |
| 2 | Oolia |
| 2 | Talia |
Second dataframe (sdf)
| sample_id | salary | time |
|-----------|--------|------|
| 1 | 20 | 0 |
| 1 | 30 | 5 |
| 1 | 40 | 10 |
| 1 | 50 | 15 |
| 2 | 33 | 0 |
| 2 | 23 | 5 |
| 2 | 24 | 10 |
| 2 | 28 | 15 |
| 2 | 29 | 20 |
So the resulting df will be like -
| sample_id | name | salary | time |
|-----------|-------|--------|------|
| 1 | Mark | 20 | 0 |
| 1 | Dart | 30 | 5 |
| 2 | Julia | 33 | 0 |
| 2 | Oolia | 23 | 5 |
| 2 | Talia | 24 | 10 |
There are duplicates, so need helper column for correct DataFrame.merge with GroupBy.cumcount for counter:
df = (fdf.assign(g=fdf.groupby('sample_id').cumcount())
.merge(sdf.assign(g=sdf.groupby('sample_id').cumcount()), on=['sample_id', 'g'])
.drop('g', axis=1))
print (df)
sample_id name salary time
0 1 Mark 20 0
1 1 Dart 30 5
2 2 Julia 33 0
3 2 Oolia 23 5
4 2 Talia 24 10
final_res = pd.merge(df,df2,on=['sample_id'],how='left')
final_res.sort_values(['sample_id','name','time'],ascending=[True,True,True],inplace=True)
final_res.drop_duplicates(subset=['sample_id','name'],keep='first',inplace=True)