slightly different results on scikit-learn decision trees regression

slightly different results on scikit-learn decision trees regression - python

The 2 codes below should IMO deliver exactly the same output, but they don't, even though the results differ only marginally. The train_test split is fixed with a specified random_state which AFAIU should guarantee reproducible results. The only code difference is that code#0 uses an explicit variable for the decision tree model.
Code #0
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
boston = load_boston()
y = boston.target
X = boston.data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
DT_regressor = DecisionTreeRegressor()
DT_model = DT_regressor.fit(X_train, y_train)
y_DT_pred = DT_model.predict(X_test)
def mse(actual, preds):
delta = np.sum((actual-preds)*(actual-preds))
return delta/len(preds)
# Check your solution matches sklearn
print('decision trees')
print(mse(y_test, y_DT_pred))
print(mean_squared_error(y_test, y_DT_pred))
print("If the above match, you are all set!")
print('predicted')
print(y_DT_pred)
print('labels')
print(y_test)
Code #1
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
boston = load_boston()
y = boston.target
X = boston.data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
tree_mod = DecisionTreeRegressor()
tree_mod.fit(X_train, y_train)
preds_tree = tree_mod.predict(X_test)
def mse(actual, preds):
return np.sum((actual-preds)**2)/len(actual)
# Check your solution matches sklearn
print(mse(y_test, preds_tree))
print(mean_squared_error(y_test, preds_tree))
print("If the above match, you are all set!")
print('predicted')
print(preds_tree)
print('labels')
print(y_test)
Even after changing random_state=0, there are differences.
output of code#0
26.12281437125748
26.12281437125748
If the above match, you are all set!
predicted
[23.4 24.5 20.1 11.7 20.7 20.4 21.8 20.5 22.7 16.1 10.8 17.9 14.9 8.8
50. 37. 21.2 32.7 28. 18.9 23.1 22.7 23.1 24.8 19.7 10.9 19.3 13.1
37.6 18.4 12.5 17.7 24.5 23.1 23.2 17.7 8.3 19.5 12.7 17.9 22.9 19.7
23.9 12.5 22. 20.5 22.4 13.8 15.6 28.7 13.8 18.3 18.2 35.2 19. 22.4
21.7 20.7 10.9 19.5 20.6 23.1 34.9 30.1 17.7 32. 16.1 18.9 16.7 21.7
20.6 23.8 23.2 33.1 28.4 8.8 41.7 23.1 22. 21.8 27.1 19.3 20.2 37.6
37.6 25. 19.3 13.8 24.3 14.3 17.5 11.8 23.1 35.1 21.6 23.8 10.2 20.7
14.3 23.1 25. 20.1 33.8 24.5 25. 23.1 8.3 19.5 23.8 22. 23.6 17.9
18.9 18.3 20. 20. 9.5 14.5 9.5 50. 32. 6.3 14.4 21.7 25. 17.3
34.9 22.5 18.9 36.1 12.5 9.5 15.2 19.6 10.5 34.9 20. 15.6 28.6 8.3
10.9 21.8 23.6 24.4 24.2 14.5 37.3 37.3 12.8 6.3 28.4 25. 15.6 32.4
17.4 23.7 17.3 19.7 21.8 13.1 8.3 17.5 34.9 31.6 31. 23.1 23.1]
labels
[22.6 50. 23. 8.3 21.2 19.9 20.6 18.7 16.1 18.6 8.8 17.2 14.9 10.5
50. 29. 23. 33.3 29.4 21. 23.8 19.1 20.4 29.1 19.3 23.1 19.6 19.4
38.7 18.7 14.6 20. 20.5 20.1 23.6 16.8 5.6 50. 14.5 13.3 23.9 20.
19.8 13.8 16.5 21.6 20.3 17. 11.8 27.5 15.6 23.1 24.3 42.8 15.6 21.7
17.1 17.2 15. 21.7 18.6 21. 33.1 31.5 20.1 29.8 15.2 15. 27.5 22.6
20. 21.4 23.5 31.2 23.7 7.4 48.3 24.4 22.6 18.3 23.3 17.1 27.9 44.8
50. 23. 21.4 10.2 23.3 23.2 18.9 13.4 21.9 24.8 11.9 24.3 13.8 24.7
14.1 18.7 28.1 19.8 26.7 21.7 22. 22.9 10.4 21.9 20.6 26.4 41.3 17.2
27.1 20.4 16.5 24.4 8.4 23. 9.7 50. 30.5 12.3 19.4 21.2 20.3 18.8
33.4 18.5 19.6 33.2 13.1 7.5 13.6 17.4 8.4 35.4 24. 13.4 26.2 7.2
13.1 24.5 37.2 25. 24.1 16.6 32.9 36.2 11. 7.2 22.8 28.7 14.4 24.4
18.1 22.5 20.5 15.2 17.4 13.6 8.7 18.2 35.4 31.7 33. 22.2 20.4]
output of code#1
28.135568862275445
28.135568862275445
If the above match, you are all set!
predicted
[23.1 24.5 20.1 19.1 20.7 20.4 21.8 19. 21.8 16.1 10.8 17.9 14.9 8.8
50. 37. 21.2 32.7 24.5 18.9 23.1 21.5 20.1 24.8 19.7 10.9 19.3 15.6
37.6 18.8 12.5 19.1 24.5 23.1 23.9 17.7 7. 19.5 12.7 17.9 22.9 19.7
23.9 12.5 22. 20.5 22.5 13.3 15.6 28.4 13.3 18.4 18.2 21.9 18.4 22.4
21.7 20.7 10.9 19.3 19.4 23.1 35.1 30.1 19.1 32. 16.1 18.9 16.7 21.7
20.6 23.8 23.7 33.1 28.6 7.2 41.7 23.1 22. 21.7 27.1 19.2 20.2 37.6
37.6 25. 19.3 13.8 24.3 14.3 17.5 11.8 23.2 34.9 21.6 23.8 10.9 22.3
14.3 23.1 25. 20.1 30.3 24.5 21. 23.1 8.3 19.9 23.8 22. 23.6 17.9
20. 18.4 18.9 20.7 9.5 14.5 10.2 50. 32. 6.3 14.4 21.7 25. 17.4
34.9 22.5 18.9 37.3 12.7 9.5 15.2 19.6 10.8 34.9 22.2 15.6 28.6 7.
10.9 21.7 23.6 24.4 24.2 16. 37.3 37.3 12.8 8.8 28.6 25.3 14.3 32.5
17.4 23.7 17.4 19.9 21.7 12.7 7. 17.6 35.1 31.5 30.3 23.1 22.1]
labels
[22.6 50. 23. 8.3 21.2 19.9 20.6 18.7 16.1 18.6 8.8 17.2 14.9 10.5
50. 29. 23. 33.3 29.4 21. 23.8 19.1 20.4 29.1 19.3 23.1 19.6 19.4
38.7 18.7 14.6 20. 20.5 20.1 23.6 16.8 5.6 50. 14.5 13.3 23.9 20.
19.8 13.8 16.5 21.6 20.3 17. 11.8 27.5 15.6 23.1 24.3 42.8 15.6 21.7
17.1 17.2 15. 21.7 18.6 21. 33.1 31.5 20.1 29.8 15.2 15. 27.5 22.6
20. 21.4 23.5 31.2 23.7 7.4 48.3 24.4 22.6 18.3 23.3 17.1 27.9 44.8
50. 23. 21.4 10.2 23.3 23.2 18.9 13.4 21.9 24.8 11.9 24.3 13.8 24.7
14.1 18.7 28.1 19.8 26.7 21.7 22. 22.9 10.4 21.9 20.6 26.4 41.3 17.2
27.1 20.4 16.5 24.4 8.4 23. 9.7 50. 30.5 12.3 19.4 21.2 20.3 18.8
33.4 18.5 19.6 33.2 13.1 7.5 13.6 17.4 8.4 35.4 24. 13.4 26.2 7.2
13.1 24.5 37.2 25. 24.1 16.6 32.9 36.2 11. 7.2 22.8 28.7 14.4 24.4
18.1 22.5 20.5 15.2 17.4 13.6 8.7 18.2 35.4 31.7 33. 22.2 20.4]

The model itself has also a random component. So fixing just the split won't be enough. Try to set
DecisionTreeRegressor(random_state=0)
as well.
If that doesn't help it would be useful if you post your results.

Related

Scikit-learn regression on two variables given a 2D matrix of reference values

I have a matrix of reference values and would like to learn how Scikit-learn can be used to generate a regression model for it. I have done several types of univariate regressions in the past but it's not clear to me how to use two variables in sklearn.
I have two features (A and B) and a table of output values for certain input A/B values. See table and 3D surface below. I'd like to see how I can translate this to a two variable equation that relates the A/B inputs to the single value output, like shown in the table. The relationship looks nonlinear and it could also be quadratic, logarithmic, etc...
How do I use sklearn to perform a nonlinear regression on this tabular data?
A/B 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
0 8.78 8.21 7.64 7.07 6.50 5.92 5.35 4.78 4.21 3.63 3.06
5 8.06 7.56 7.07 6.58 6.08 5.59 5.10 4.60 4.11 3.62 3.12
10 7.33 6.91 6.50 6.09 5.67 5.26 4.84 4.43 4.01 3.60 3.19
15 6.60 6.27 5.93 5.59 5.26 4.92 4.59 4.25 3.92 3.58 3.25
20 5.87 5.62 5.36 5.10 4.85 4.59 4.33 4.08 3.82 3.57 3.31
25 5.14 4.97 4.79 4.61 4.44 4.26 4.08 3.90 3.73 3.55 3.37
30 4.42 4.32 4.22 4.12 4.02 3.93 3.83 3.73 3.63 3.53 3.43
35 3.80 3.78 3.75 3.72 3.70 3.67 3.64 3.62 3.59 3.56 3.54
40 2.86 2.93 2.99 3.05 3.12 3.18 3.24 3.31 3.37 3.43 3.50
45 2.08 2.24 2.39 2.54 2.70 2.85 3.00 3.16 3.31 3.46 3.62
50 1.64 1.84 2.05 2.26 2.46 2.67 2.88 3.08 3.29 3.50 3.70
55 1.55 1.77 1.98 2.19 2.41 2.62 2.83 3.05 3.26 3.47 3.69
60 2.09 2.22 2.35 2.48 2.61 2.74 2.87 3.00 3.13 3.26 3.39
65 3.12 3.08 3.05 3.02 2.98 2.95 2.92 2.88 2.85 2.82 2.78
70 3.50 3.39 3.28 3.17 3.06 2.95 2.84 2.73 2.62 2.51 2.40
75 3.42 3.32 3.21 3.10 3.00 2.89 2.78 2.68 2.57 2.46 2.36
80 3.68 3.55 3.43 3.31 3.18 3.06 2.94 2.81 2.69 2.57 2.44
85 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
90 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
95 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
100 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69

There is probably a succinct nonlinear relationship between your A, B, and table values, but without some knowledge of this system nor with any sophisticated nonlinear modeling, here's a ridiculous model with a decent score.
the_table = """1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
0 8.78 8.21 7.64 7.07 6.50 5.92 5.35 4.78 4.21 3.63 3.06
5 8.06 7.56 7.07 6.58 6.08 5.59 5.10 4.60 4.11 3.62 3.12
10 7.33 6.91 6.50 6.09 5.67 5.26 4.84 4.43 4.01 3.60 3.19
15 6.60 6.27 5.93 5.59 5.26 4.92 4.59 4.25 3.92 3.58 3.25
20 5.87 5.62 5.36 5.10 4.85 4.59 4.33 4.08 3.82 3.57 3.31
25 5.14 4.97 4.79 4.61 4.44 4.26 4.08 3.90 3.73 3.55 3.37
30 4.42 4.32 4.22 4.12 4.02 3.93 3.83 3.73 3.63 3.53 3.43
35 3.80 3.78 3.75 3.72 3.70 3.67 3.64 3.62 3.59 3.56 3.54
40 2.86 2.93 2.99 3.05 3.12 3.18 3.24 3.31 3.37 3.43 3.50
45 2.08 2.24 2.39 2.54 2.70 2.85 3.00 3.16 3.31 3.46 3.62
50 1.64 1.84 2.05 2.26 2.46 2.67 2.88 3.08 3.29 3.50 3.70
55 1.55 1.77 1.98 2.19 2.41 2.62 2.83 3.05 3.26 3.47 3.69
60 2.09 2.22 2.35 2.48 2.61 2.74 2.87 3.00 3.13 3.26 3.39
65 3.12 3.08 3.05 3.02 2.98 2.95 2.92 2.88 2.85 2.82 2.78
70 3.50 3.39 3.28 3.17 3.06 2.95 2.84 2.73 2.62 2.51 2.40
75 3.42 3.32 3.21 3.10 3.00 2.89 2.78 2.68 2.57 2.46 2.36
80 3.68 3.55 3.43 3.31 3.18 3.06 2.94 2.81 2.69 2.57 2.44
85 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
90 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
95 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69
100 3.43 3.35 3.28 3.21 3.13 3.06 2.99 2.91 2.84 2.77 2.69"""
import numpy as np
import pandas as pd
import io
df = pd.read_csv(io.StringIO(initial_value=the_table), sep="\s+")
df.columns = df.columns.astype(np.uint64)
df_unstacked = df.unstack()
X = df_unstacked.index.tolist()
y = df_unstacked.to_list()
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(6)
poly_X = poly.fit_transform(X)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(poly_X,y)
print(model.score(poly_X,y))
# 0.9762180339233807
To predict values given an A and a B using this model, you need to transform the input like was done for creating the model. So for example,
model.predict(poly.transform([(1534,56)]))
# array([2.75275659])
Even more outrageously ridiculous ...
more_X = [(a,b,np.log1p(a),np.log1p(b),np.cos(np.pi*b/100)) for a,b in X]
poly = PolynomialFeatures(5)
poly_X = poly.fit_transform(more_X)
model.fit(poly_X,y)
print(model.score(poly_X,y))
# 0.9982994398684035
... and to predict:
more_X = [(a,b,np.log1p(a),np.log1p(b),np.cos(np.pi*b/100)) for a,b in [(1534,56)]]
model.predict(poly.transform(more_X))
# array([2.74577017])
N.B.: There's probably better ways to Pythonically program these ridiculous models.

Creating a BMI table

I'm trying to create a BMI table with a column for height from 58 inches to 76 inches in 2-inch increments and a row for weight from 100 pounds to 250 pounds in 10-pound increments, I've got the row and the column, but I can't figure out how to calculate the different BMI's within the table.
This is my code:
header = '\t{}'.format('\t'.join(map(str, range(100, 260, 10))))
rows = []
for i in range(58, 78, 2):
row = '\t'.join(map(str, (bmi for q in range(1, 17))))
rows.append('{}\t{}'.format(i, row))
print(header + '\n' + '\n'.join(rows))
This is the output:
100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250
58
60
62
64
66
68
70
72
74
76
What I'm trying to do is fill in the chart. For example, a height of 58 inches and 100 pounds is a BMI of 22.4. A height of 58 inches and 110 pounds is 24.7, and so on.

I'm not sure how you got your expected results of 22.4 and 22.7, but if you define BMI to be weight [lb] / (height [in])^2 * 703, you could do something like the following:
In [16]: weights = range(100, 260, 10)
...: header = '\t' + '\t'.join(map(str, weights))
...: rows = [header]
...: for height in range(58, 78, 2):
...: row = '\t'.join(f'{weight/height**2*703:.1f}' for weight in weights)
...: rows.append(f'{height}\t{row}')
...: print('\n'.join(rows))
...:
100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250
58 20.9 23.0 25.1 27.2 29.3 31.3 33.4 35.5 37.6 39.7 41.8 43.9 46.0 48.1 50.2 52.2
60 19.5 21.5 23.4 25.4 27.3 29.3 31.2 33.2 35.1 37.1 39.1 41.0 43.0 44.9 46.9 48.8
62 18.3 20.1 21.9 23.8 25.6 27.4 29.3 31.1 32.9 34.7 36.6 38.4 40.2 42.1 43.9 45.7
64 17.2 18.9 20.6 22.3 24.0 25.7 27.5 29.2 30.9 32.6 34.3 36.0 37.8 39.5 41.2 42.9
66 16.1 17.8 19.4 21.0 22.6 24.2 25.8 27.4 29.0 30.7 32.3 33.9 35.5 37.1 38.7 40.3
68 15.2 16.7 18.2 19.8 21.3 22.8 24.3 25.8 27.4 28.9 30.4 31.9 33.4 35.0 36.5 38.0
70 14.3 15.8 17.2 18.7 20.1 21.5 23.0 24.4 25.8 27.3 28.7 30.1 31.6 33.0 34.4 35.9
72 13.6 14.9 16.3 17.6 19.0 20.3 21.7 23.1 24.4 25.8 27.1 28.5 29.8 31.2 32.5 33.9
74 12.8 14.1 15.4 16.7 18.0 19.3 20.5 21.8 23.1 24.4 25.7 27.0 28.2 29.5 30.8 32.1
76 12.2 13.4 14.6 15.8 17.0 18.3 19.5 20.7 21.9 23.1 24.3 25.6 26.8 28.0 29.2 30.4
What's probably keeping you down in your own code is the for q in range(1, 17) which you'll want to turn into your weights instead; you could just replace it with for q in range(100, 260, 10) and use the formula directly if you liked, but here we just avoid the duplication by introducing weights.

First of all, you should remove the indent print statement at the end. Running this code with the indent prints out one table as each row is put in. Secondly, the snippet of code you will want to change is
(bmi for q in range(1, 17))
Since BMI is a function of mass and height, I would change your iterator i to height, q to mass, and range(1, 17) to range(100, 260, 10). This is to improve readability. Then, replace bmi with an expression using mass and height that returns bmi. For example,
(mass*height for mass in range(100, 260, 10))
I don't believe BMI=mass*height, but replace this with the real formula.

Dropping NaNs from selected data in pandas

Continuing on my previous question link (things are explained there), I now have obtained an array. However, I don't know how to use this array, but that is a further question. The point of this question is, there are NaN values in the 63 x 2 column that I created and I want the rows with NaN values deleted so that I can use the data (once I ask another question on how to graph and export as x , y arrays)
Here's what I have. This code works.
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = [df.iloc[:, [0, 1]]]
The sample of the .csv file is located in the link.
I tried inputting
data1.dropna()
but it didn't work.
I want the NaN values/rows to drop so that I'm left with a 28 x 2 array. (I am using the first column with actual values as an example).
Thank you.

Try
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = df.iloc[:, [0, 1]]
cleaned_data = data1.dropna()
You were probably getting an Exception like "List does not have a method 'dropna'". That's because your data1 was not a Pandas DataFrame, but a List - and inside that list was a DataFrame.

However the answer is already given, Though i would like to put some thoughts across this.
Importing Your dataFrame taking the example dataset from your earlier post you provided:
>>> import pandas as pd
>>> df = pd.read_csv("so.csv")
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
22 11.0 30.0 29.7 29.6 ... 39.3 NaN 43.8 44.3
23 11.5 30.0 29.8 29.7 ... 40.2 NaN 43.8 44.3
24 12.0 30.0 29.8 29.7 ... 40.9 NaN 43.9 44.3
25 12.5 30.1 29.8 29.7 ... 41.4 NaN 43.9 44.3
26 13.0 30.1 29.8 29.8 ... 41.8 NaN 43.9 44.4
27 13.5 30.1 29.9 29.8 ... 42.0 NaN 43.9 44.4
28 14.0 30.1 29.9 29.8 ... 42.1 NaN NaN 44.4
29 14.5 NaN 29.9 29.8 ... 42.3 NaN NaN 44.4
30 15.0 NaN 29.9 NaN ... 42.4 NaN NaN NaN
31 15.5 NaN NaN NaN ... 42.4 NaN NaN NaN
However, It good to clean the data beforehand and then process the data as you desired hence dropping the NA values during import itself will be significantly useful.
>>> df = pd.read_csv("so.csv").dropna() <-- dropping the NA here itself
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
and lastly cast your dataFrame as you wish:
>>> df = [df.iloc[:, [0, 1]]]
# new_df = [df.iloc[:, [0, 1]]] <-- if you don't want to alter actual dataFrame
>>> df
[ time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0]
Better Solution:
While looking at the end result, i see you are just concerning about the particular columns those are 'time' & '1mnaoh trial 1' hence idealistic would be to use usecole option which will reduce your memory footprint for the search across the data because you just opted the only columns which are useful for you and then use dropna() which will give you wanted you wanted i believe.
>>> df = pd.read_csv("so.csv", usecols=['time', '1mnaoh trial 1']).dropna()
>>> df
time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0
22 11.0 30.0
23 11.5 30.0
24 12.0 30.0
25 12.5 30.1
26 13.0 30.1
27 13.5 30.1
28 14.0 30.1

Error in function to return 3 largest values from a list of numbers

I have this data file and I have to find the 3 largest numbers it contains
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
Therefore I have written the following code, but it only searches the first row of numbers instead of the entire list. Can anyone help to find the error?
def three_highest_temps(f):
file = open(f, "r")
largest = 0
second_largest = 0
third_largest = 0
temp = []
for line in file:
temps = line.split()
for i in temps:
if i > largest:
largest = i
elif largest > i > second_largest:
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
print(three_highest_temps("data5.txt"))

Your data contains float numbers not integer.
You can use sorted:
>>> data = '''24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
... 16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
... 10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
... 21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
... 19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
... 14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
... 8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
... 11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
... 13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
... 22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
... 17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
... 20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
... '''
>>> sorted(map(float, data.split()), reverse=True)[:3]
[74.0, 73.7, 73.7]
If you want to integer results
>>> temps = sorted(map(float, data.split()), reverse=True)[:3]
>>> map(int, temps)
[74, 73, 73]

You only get the max elements for the first line because you return at the end of the first iteration. You should de-indent the return statement.
Sorting the data and picking the first 3 elements runs in n*log(n).
data = [float(v) for v in line.split() for line in file]
sorted(data, reverse=True)[:3]
It is perfectly fine for 144 elements.
You can also get the answer in linear time using a heapq
import heapq
heapq.nlargest(3, data)

Your return statement is inside the for loop. Once return is reached, the function terminates, so the loop never gets into a second iteration. Move the return outside the loop by reducing indentation.
for line in file:
temps = line.split()
for i in temps:
if i > largest:
largest = i
elif largest > i > second_largest:
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
In addition, your comparisons won't work, because line.split() returns a list of strings, not floats. (As has been pointed out, your data consists of floats, not ints. I'm assuming the task is to find the largest float.) So let's convert the strings using float()
Your code still won't be correct, though, because when you find a new largest value, you completely discard the old one. Instead you should now consider it the second largest known value. Same rule applies for second to third largest.
for line in file:
temps = line.split()
for temp_string in temps:
i = float(temp_string)
if i > largest:
third_largest = second_largest
second_largest = largest
largest = i
elif largest > i > second_largest:
third_largest = second_largest
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
Now there is one last issue:
You overlook cases where i is identical with one of the largest values. In such a case i > largest would be false, but so would largest > i. You could change either of these comparisons to >= to fix this.
Instead, let us simplify the if clauses by considering that the elif conditions are only considered after all previous conditions were already found to be false. When we reach the first elif, we already know that i can not be larger than largest, so it suffices to compare it to second largest. The same goes for the second elif.
for line in file:
temps = line.split()
for temp_string in temps:
i = float(temp_string)
if i > largest:
third_largest = second_largest
second_largest = largest
largest = i
elif i > second_largest:
third_largest = second_largest
second_largest = i
elif i > third_largest:
third_largest = i
return largest, second_largest, third_largest
This way we avoid accidentally filtering out the i == largest and i == second_largest edge cases.

Since you are dealing with a file, as a cast and numpythonic approach you can load the file as an array and then sort the array and get the last 3 item :
import numpy as np
with open('filename') as f:
array = np.genfromtxt(f).ravel()
array.sort()
print array[-3:]
[ 73.7 73.7 74. ]

Python: How to find special value in special lane?

I asked about it yesterday, and some1 gave me a great answer.
But I need to ask one more question.
[
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
]
that's what i got from website, and i used this
for line in mystr.split('\n'):
if not line:
continue
print (line.split()[3])enter code here
when i use this, i got every fourth value in every line.
That's almost I want, but if i print it, i also get "in" and "december"
how can I get rid of this two words?

Skip the first two lines.
text = iter(mystr.split('\n'))
next(text)
next(text)
for line in text:
...
...
for line in itertools.islice(mystr.split('\n'), 2, None):
...

Getting something that should be a float but isn't is certainly a ValueError exception try the following
for line in mystr.split('\n'):
if not line:
continue
try:
print (float(line.split()[3]))
except ValueError:
pass

Replace print (line.split()[3])enter code here with:
if line.split()[3] not in ['in', 'december']:
print (line.split()[3])
or, more generic:
value = line.split(3)
try:
value = float(value)
print value
except ValueError:
pass

It is good to use generators in such case, where you can use try: ... except:.... My take would be:
txt = """[
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5
16.1 19.1 24.2 45.4 61.3 66.5
10.4 21.6 37.4 44.7 53.2 68.0"""
def my_numbers(txt):
for line in txt.splitlines():
try:
yield float(line.split()[3])
except (ValueError, IndexError):
# if conversion fails or not enough tokens in line
continue
result = list(my_numbers(txt))
print result # output: [47.5, 45.4, 44.7]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

slightly different results on scikit-learn decision trees regression - python

The model itself has also a random component. So fixing just the split won't be enough. Try to set DecisionTreeRegressor(random_state=0) as well. If that doesn't help it would be useful if you post your results.

Related

Scikit-learn regression on two variables given a 2D matrix of reference values

Creating a BMI table

Dropping NaNs from selected data in pandas

Error in function to return 3 largest values from a list of numbers

Python: How to find special value in special lane?

Categories

Resources