I asked about it yesterday, and some1 gave me a great answer.
But I need to ask one more question.
[
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
]
that's what i got from website, and i used this
for line in mystr.split('\n'):
if not line:
continue
print (line.split()[3])enter code here
when i use this, i got every fourth value in every line.
That's almost I want, but if i print it, i also get "in" and "december"
how can I get rid of this two words?
Skip the first two lines.
text = iter(mystr.split('\n'))
next(text)
next(text)
for line in text:
...
...
for line in itertools.islice(mystr.split('\n'), 2, None):
...
Getting something that should be a float but isn't is certainly a ValueError exception try the following
for line in mystr.split('\n'):
if not line:
continue
try:
print (float(line.split()[3]))
except ValueError:
pass
Replace print (line.split()[3])enter code here with:
if line.split()[3] not in ['in', 'december']:
print (line.split()[3])
or, more generic:
value = line.split(3)
try:
value = float(value)
print value
except ValueError:
pass
It is good to use generators in such case, where you can use try: ... except:.... My take would be:
txt = """[
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5
16.1 19.1 24.2 45.4 61.3 66.5
10.4 21.6 37.4 44.7 53.2 68.0"""
def my_numbers(txt):
for line in txt.splitlines():
try:
yield float(line.split()[3])
except (ValueError, IndexError):
# if conversion fails or not enough tokens in line
continue
result = list(my_numbers(txt))
print result # output: [47.5, 45.4, 44.7]
Related
The 2 codes below should IMO deliver exactly the same output, but they don't, even though the results differ only marginally. The train_test split is fixed with a specified random_state which AFAIU should guarantee reproducible results. The only code difference is that code#0 uses an explicit variable for the decision tree model.
Code #0
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
boston = load_boston()
y = boston.target
X = boston.data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
DT_regressor = DecisionTreeRegressor()
DT_model = DT_regressor.fit(X_train, y_train)
y_DT_pred = DT_model.predict(X_test)
def mse(actual, preds):
delta = np.sum((actual-preds)*(actual-preds))
return delta/len(preds)
# Check your solution matches sklearn
print('decision trees')
print(mse(y_test, y_DT_pred))
print(mean_squared_error(y_test, y_DT_pred))
print("If the above match, you are all set!")
print('predicted')
print(y_DT_pred)
print('labels')
print(y_test)
Code #1
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
boston = load_boston()
y = boston.target
X = boston.data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
tree_mod = DecisionTreeRegressor()
tree_mod.fit(X_train, y_train)
preds_tree = tree_mod.predict(X_test)
def mse(actual, preds):
return np.sum((actual-preds)**2)/len(actual)
# Check your solution matches sklearn
print(mse(y_test, preds_tree))
print(mean_squared_error(y_test, preds_tree))
print("If the above match, you are all set!")
print('predicted')
print(preds_tree)
print('labels')
print(y_test)
Even after changing random_state=0, there are differences.
output of code#0
26.12281437125748
26.12281437125748
If the above match, you are all set!
predicted
[23.4 24.5 20.1 11.7 20.7 20.4 21.8 20.5 22.7 16.1 10.8 17.9 14.9 8.8
50. 37. 21.2 32.7 28. 18.9 23.1 22.7 23.1 24.8 19.7 10.9 19.3 13.1
37.6 18.4 12.5 17.7 24.5 23.1 23.2 17.7 8.3 19.5 12.7 17.9 22.9 19.7
23.9 12.5 22. 20.5 22.4 13.8 15.6 28.7 13.8 18.3 18.2 35.2 19. 22.4
21.7 20.7 10.9 19.5 20.6 23.1 34.9 30.1 17.7 32. 16.1 18.9 16.7 21.7
20.6 23.8 23.2 33.1 28.4 8.8 41.7 23.1 22. 21.8 27.1 19.3 20.2 37.6
37.6 25. 19.3 13.8 24.3 14.3 17.5 11.8 23.1 35.1 21.6 23.8 10.2 20.7
14.3 23.1 25. 20.1 33.8 24.5 25. 23.1 8.3 19.5 23.8 22. 23.6 17.9
18.9 18.3 20. 20. 9.5 14.5 9.5 50. 32. 6.3 14.4 21.7 25. 17.3
34.9 22.5 18.9 36.1 12.5 9.5 15.2 19.6 10.5 34.9 20. 15.6 28.6 8.3
10.9 21.8 23.6 24.4 24.2 14.5 37.3 37.3 12.8 6.3 28.4 25. 15.6 32.4
17.4 23.7 17.3 19.7 21.8 13.1 8.3 17.5 34.9 31.6 31. 23.1 23.1]
labels
[22.6 50. 23. 8.3 21.2 19.9 20.6 18.7 16.1 18.6 8.8 17.2 14.9 10.5
50. 29. 23. 33.3 29.4 21. 23.8 19.1 20.4 29.1 19.3 23.1 19.6 19.4
38.7 18.7 14.6 20. 20.5 20.1 23.6 16.8 5.6 50. 14.5 13.3 23.9 20.
19.8 13.8 16.5 21.6 20.3 17. 11.8 27.5 15.6 23.1 24.3 42.8 15.6 21.7
17.1 17.2 15. 21.7 18.6 21. 33.1 31.5 20.1 29.8 15.2 15. 27.5 22.6
20. 21.4 23.5 31.2 23.7 7.4 48.3 24.4 22.6 18.3 23.3 17.1 27.9 44.8
50. 23. 21.4 10.2 23.3 23.2 18.9 13.4 21.9 24.8 11.9 24.3 13.8 24.7
14.1 18.7 28.1 19.8 26.7 21.7 22. 22.9 10.4 21.9 20.6 26.4 41.3 17.2
27.1 20.4 16.5 24.4 8.4 23. 9.7 50. 30.5 12.3 19.4 21.2 20.3 18.8
33.4 18.5 19.6 33.2 13.1 7.5 13.6 17.4 8.4 35.4 24. 13.4 26.2 7.2
13.1 24.5 37.2 25. 24.1 16.6 32.9 36.2 11. 7.2 22.8 28.7 14.4 24.4
18.1 22.5 20.5 15.2 17.4 13.6 8.7 18.2 35.4 31.7 33. 22.2 20.4]
output of code#1
28.135568862275445
28.135568862275445
If the above match, you are all set!
predicted
[23.1 24.5 20.1 19.1 20.7 20.4 21.8 19. 21.8 16.1 10.8 17.9 14.9 8.8
50. 37. 21.2 32.7 24.5 18.9 23.1 21.5 20.1 24.8 19.7 10.9 19.3 15.6
37.6 18.8 12.5 19.1 24.5 23.1 23.9 17.7 7. 19.5 12.7 17.9 22.9 19.7
23.9 12.5 22. 20.5 22.5 13.3 15.6 28.4 13.3 18.4 18.2 21.9 18.4 22.4
21.7 20.7 10.9 19.3 19.4 23.1 35.1 30.1 19.1 32. 16.1 18.9 16.7 21.7
20.6 23.8 23.7 33.1 28.6 7.2 41.7 23.1 22. 21.7 27.1 19.2 20.2 37.6
37.6 25. 19.3 13.8 24.3 14.3 17.5 11.8 23.2 34.9 21.6 23.8 10.9 22.3
14.3 23.1 25. 20.1 30.3 24.5 21. 23.1 8.3 19.9 23.8 22. 23.6 17.9
20. 18.4 18.9 20.7 9.5 14.5 10.2 50. 32. 6.3 14.4 21.7 25. 17.4
34.9 22.5 18.9 37.3 12.7 9.5 15.2 19.6 10.8 34.9 22.2 15.6 28.6 7.
10.9 21.7 23.6 24.4 24.2 16. 37.3 37.3 12.8 8.8 28.6 25.3 14.3 32.5
17.4 23.7 17.4 19.9 21.7 12.7 7. 17.6 35.1 31.5 30.3 23.1 22.1]
labels
[22.6 50. 23. 8.3 21.2 19.9 20.6 18.7 16.1 18.6 8.8 17.2 14.9 10.5
50. 29. 23. 33.3 29.4 21. 23.8 19.1 20.4 29.1 19.3 23.1 19.6 19.4
38.7 18.7 14.6 20. 20.5 20.1 23.6 16.8 5.6 50. 14.5 13.3 23.9 20.
19.8 13.8 16.5 21.6 20.3 17. 11.8 27.5 15.6 23.1 24.3 42.8 15.6 21.7
17.1 17.2 15. 21.7 18.6 21. 33.1 31.5 20.1 29.8 15.2 15. 27.5 22.6
20. 21.4 23.5 31.2 23.7 7.4 48.3 24.4 22.6 18.3 23.3 17.1 27.9 44.8
50. 23. 21.4 10.2 23.3 23.2 18.9 13.4 21.9 24.8 11.9 24.3 13.8 24.7
14.1 18.7 28.1 19.8 26.7 21.7 22. 22.9 10.4 21.9 20.6 26.4 41.3 17.2
27.1 20.4 16.5 24.4 8.4 23. 9.7 50. 30.5 12.3 19.4 21.2 20.3 18.8
33.4 18.5 19.6 33.2 13.1 7.5 13.6 17.4 8.4 35.4 24. 13.4 26.2 7.2
13.1 24.5 37.2 25. 24.1 16.6 32.9 36.2 11. 7.2 22.8 28.7 14.4 24.4
18.1 22.5 20.5 15.2 17.4 13.6 8.7 18.2 35.4 31.7 33. 22.2 20.4]
The model itself has also a random component. So fixing just the split won't be enough. Try to set
DecisionTreeRegressor(random_state=0)
as well.
If that doesn't help it would be useful if you post your results.
I'm trying to create a BMI table with a column for height from 58 inches to 76 inches in 2-inch increments and a row for weight from 100 pounds to 250 pounds in 10-pound increments, I've got the row and the column, but I can't figure out how to calculate the different BMI's within the table.
This is my code:
header = '\t{}'.format('\t'.join(map(str, range(100, 260, 10))))
rows = []
for i in range(58, 78, 2):
row = '\t'.join(map(str, (bmi for q in range(1, 17))))
rows.append('{}\t{}'.format(i, row))
print(header + '\n' + '\n'.join(rows))
This is the output:
100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250
58
60
62
64
66
68
70
72
74
76
What I'm trying to do is fill in the chart. For example, a height of 58 inches and 100 pounds is a BMI of 22.4. A height of 58 inches and 110 pounds is 24.7, and so on.
I'm not sure how you got your expected results of 22.4 and 22.7, but if you define BMI to be weight [lb] / (height [in])^2 * 703, you could do something like the following:
In [16]: weights = range(100, 260, 10)
...: header = '\t' + '\t'.join(map(str, weights))
...: rows = [header]
...: for height in range(58, 78, 2):
...: row = '\t'.join(f'{weight/height**2*703:.1f}' for weight in weights)
...: rows.append(f'{height}\t{row}')
...: print('\n'.join(rows))
...:
100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250
58 20.9 23.0 25.1 27.2 29.3 31.3 33.4 35.5 37.6 39.7 41.8 43.9 46.0 48.1 50.2 52.2
60 19.5 21.5 23.4 25.4 27.3 29.3 31.2 33.2 35.1 37.1 39.1 41.0 43.0 44.9 46.9 48.8
62 18.3 20.1 21.9 23.8 25.6 27.4 29.3 31.1 32.9 34.7 36.6 38.4 40.2 42.1 43.9 45.7
64 17.2 18.9 20.6 22.3 24.0 25.7 27.5 29.2 30.9 32.6 34.3 36.0 37.8 39.5 41.2 42.9
66 16.1 17.8 19.4 21.0 22.6 24.2 25.8 27.4 29.0 30.7 32.3 33.9 35.5 37.1 38.7 40.3
68 15.2 16.7 18.2 19.8 21.3 22.8 24.3 25.8 27.4 28.9 30.4 31.9 33.4 35.0 36.5 38.0
70 14.3 15.8 17.2 18.7 20.1 21.5 23.0 24.4 25.8 27.3 28.7 30.1 31.6 33.0 34.4 35.9
72 13.6 14.9 16.3 17.6 19.0 20.3 21.7 23.1 24.4 25.8 27.1 28.5 29.8 31.2 32.5 33.9
74 12.8 14.1 15.4 16.7 18.0 19.3 20.5 21.8 23.1 24.4 25.7 27.0 28.2 29.5 30.8 32.1
76 12.2 13.4 14.6 15.8 17.0 18.3 19.5 20.7 21.9 23.1 24.3 25.6 26.8 28.0 29.2 30.4
What's probably keeping you down in your own code is the for q in range(1, 17) which you'll want to turn into your weights instead; you could just replace it with for q in range(100, 260, 10) and use the formula directly if you liked, but here we just avoid the duplication by introducing weights.
First of all, you should remove the indent print statement at the end. Running this code with the indent prints out one table as each row is put in. Secondly, the snippet of code you will want to change is
(bmi for q in range(1, 17))
Since BMI is a function of mass and height, I would change your iterator i to height, q to mass, and range(1, 17) to range(100, 260, 10). This is to improve readability. Then, replace bmi with an expression using mass and height that returns bmi. For example,
(mass*height for mass in range(100, 260, 10))
I don't believe BMI=mass*height, but replace this with the real formula.
I am trying to create a time-series from historical data stored in Excel sheets on a website. The website has the Excel spreadsheets organized by year (i.e., Financial futures positions for 2009, 2010, 2011,....).
Is there a way to pull all the relevant files at once to be used in a DataFrame?
I am pretty new to Python and my first thought was to download each file manually as an Excel doc and then read them to a DF with python. Was trying to find a more elegant solution to this process.
Website URL: https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm
The page has several groups of files. I'm trying to find a way to select specific files/groups of files. I'm currently Googling around for solutions involving breaking down the website HTML using Beautiful Soup or something like that.
There's probably a more elegant way to find the <p> tag with the associated table of zip files/links you want, but this seemed to get the job done.
You also might just want to double check it's all there. For some it was throwing a warning: "WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero", but looks to still be there.
Code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from zipfile import ZipFile
from io import BytesIO
url = 'https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the p tag with the specific table you wanted
p_tag = soup.find_all('p')
for p in p_tag:
if 'The complete Commitments of Traders Futures Only reports' in p.text:
break
# Get that table with the zip links
table = p.find_next_sibling('table')
a_tags = table.find_all('a', text = 'Excel')
# Create list of those links
files_list = []
for a in a_tags:
href = 'https://www.cftc.gov' + a['href']
files_list.append(href)
# Iterate through those links, get he table within the zip, and append to a results dataframe
results = pd.DataFrame()
for file_name in files_list[:-1]:
year = file_name.split('_')[-1].split('.')[0]
content = requests.get(file_name)
zf = ZipFile(BytesIO(content.content))
excel_file = zf.namelist()[0]
temp_df = pd.read_excel(zf.open(excel_file))
results = results.append(temp_df, sort=True).reset_index(drop=True)
print ('Recieved: %s' %year)
Output:
print (results.head(5).to_string())
As_of_Date_In_Form_YYMMDD CFTC_Commodity_Code CFTC_Contract_Market_Code CFTC_Market_Code CFTC_Region_Code Change_in_Comm_Long_All Change_in_Comm_Short_All Change_in_NonComm_Long_All Change_in_NonComm_Short_All Change_in_NonComm_Spead_All Change_in_NonRept_Long_All Change_in_NonRept_Short_All Change_in_Open_Interest_All Change_in_Tot_Rept_Long_All Change_in_Tot_Rept_Short_All Comm_Positions_Long_All Comm_Positions_Long_Old Comm_Positions_Long_Other Comm_Positions_Short_All Comm_Positions_Short_Old Comm_Positions_Short_Other Conc_Gross_LE_4_TDR_Long_All Conc_Gross_LE_4_TDR_Long_Old Conc_Gross_LE_4_TDR_Long_Other Conc_Gross_LE_4_TDR_Short_All Conc_Gross_LE_4_TDR_Short_Old Conc_Gross_LE_4_TDR_Short_Other Conc_Gross_LE_8_TDR_Long_All Conc_Gross_LE_8_TDR_Long_Old Conc_Gross_LE_8_TDR_Long_Other Conc_Gross_LE_8_TDR_Short_All Conc_Gross_LE_8_TDR_Short_Old Conc_Gross_LE_8_TDR_Short_Other Conc_Net_LE_4_TDR_Long_All Conc_Net_LE_4_TDR_Long_Old Conc_Net_LE_4_TDR_Long_Other Conc_Net_LE_4_TDR_Short_All Conc_Net_LE_4_TDR_Short_Old Conc_Net_LE_4_TDR_Short_Other Conc_Net_LE_8_TDR_Long_All Conc_Net_LE_8_TDR_Long_Old Conc_Net_LE_8_TDR_Long_Other Conc_Net_LE_8_TDR_Short_All Conc_Net_LE_8_TDR_Short_Old Conc_Net_LE_8_TDR_Short_Other Contract_Units Market_and_Exchange_Names NonComm_Positions_Long_All NonComm_Positions_Long_Old NonComm_Positions_Long_Other NonComm_Positions_Short_All NonComm_Positions_Short_Old NonComm_Positions_Short_Other NonComm_Positions_Spread_Old NonComm_Positions_Spread_Other NonComm_Postions_Spread_All NonRept_Positions_Long_All NonRept_Positions_Long_Old NonRept_Positions_Long_Other NonRept_Positions_Short_All NonRept_Positions_Short_Old NonRept_Positions_Short_Other Open_Interest_All Open_Interest_Old Open_Interest_Other Pct_of_OI_Comm_Long_All Pct_of_OI_Comm_Long_Old Pct_of_OI_Comm_Long_Other Pct_of_OI_Comm_Short_All Pct_of_OI_Comm_Short_Old Pct_of_OI_Comm_Short_Other Pct_of_OI_NonComm_Long_All Pct_of_OI_NonComm_Long_Old Pct_of_OI_NonComm_Long_Other Pct_of_OI_NonComm_Short_All Pct_of_OI_NonComm_Short_Old Pct_of_OI_NonComm_Short_Other Pct_of_OI_NonComm_Spread_All Pct_of_OI_NonComm_Spread_Old Pct_of_OI_NonComm_Spread_Other Pct_of_OI_NonRept_Long_All Pct_of_OI_NonRept_Long_Old Pct_of_OI_NonRept_Long_Other Pct_of_OI_NonRept_Short_All Pct_of_OI_NonRept_Short_Old Pct_of_OI_NonRept_Short_Other Pct_of_OI_Tot_Rept_Long_All Pct_of_OI_Tot_Rept_Long_Old Pct_of_OI_Tot_Rept_Long_Other Pct_of_OI_Tot_Rept_Short_All Pct_of_OI_Tot_Rept_Short_Old Pct_of_OI_Tot_Rept_Short_Other Pct_of_Open_Interest_All Pct_of_Open_Interest_Old Pct_of_Open_Interest_Other Report_Date_as_MM_DD_YYYY Tot_Rept_Positions_Long_All Tot_Rept_Positions_Long_Old Tot_Rept_Positions_Long_Other Tot_Rept_Positions_Short_All Tot_Rept_Positions_Short_Old Tot_Rept_Positions_Short_Other Traders_Comm_Long_All Traders_Comm_Long_Old Traders_Comm_Long_Other Traders_Comm_Short_All Traders_Comm_Short_Old Traders_Comm_Short_Other Traders_NonComm_Long_All Traders_NonComm_Long_Old Traders_NonComm_Long_Other Traders_NonComm_Short_All Traders_NonComm_Short_Old Traders_NonComm_Short_Other Traders_NonComm_Spead_Old Traders_NonComm_Spread_All Traders_NonComm_Spread_Other Traders_Tot_All Traders_Tot_Old Traders_Tot_Other Traders_Tot_Rept_Long_All Traders_Tot_Rept_Long_Old Traders_Tot_Rept_Long_Other Traders_Tot_Rept_Short_All Traders_Tot_Rept_Short_Old Traders_Tot_Rept_Short_Other
0 190910 1 001602 CBT 0 -4068.0 4892.0 6487.0 -2906.0 8280.0 -944.0 -511.0 9755.0 10699.0 10266.0 107772 92050 15722 102893 86995 15898 11.9 12.6 31.6 11.6 12.6 26.7 21.1 22.3 46.5 20.0 21.9 39.6 10.8 11.1 31.5 10.0 10.4 23.4 17.4 18.6 45.4 15.5 16.7 36.1 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 121056 112747 22502 111727 105822 20098 88722 6159 109074 25816 20362 5454 40024 32342 7682 363718 313881 49837 29.6 29.3 31.5 28.3 27.7 31.9 33.3 35.9 45.2 30.7 33.7 40.3 30.0 28.3 12.4 7.1 6.5 10.9 11.0 10.3 15.4 92.9 93.5 89.1 89.0 89.7 84.6 100 100 100 2019-09-10 337902 293519 44383 323694 281539 42155 80 72 34 92 87 49 101 104 35 115 109 41 103 118 20 346 338 145 252 232 83 262 247 99
1 190903 1 001602 CBT 0 -703.0 -15572.0 -3482.0 13336.0 -702.0 337.0 -1612.0 -4550.0 -4887.0 -2938.0 111840 97821 14019 98001 82374 15627 13.2 13.7 32.6 11.9 13.0 25.9 21.9 22.9 45.3 20.1 22.2 37.7 12.1 12.2 32.6 9.9 10.4 22.2 18.5 19.0 44.3 16.1 17.7 33.7 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 114569 103964 22404 114633 108323 18109 83004 5991 100794 26760 21199 5561 40535 32287 8248 353963 305988 47975 31.6 32.0 29.2 27.7 26.9 32.6 32.4 34.0 46.7 32.4 35.4 37.7 28.5 27.1 12.5 7.6 6.9 11.6 11.5 10.6 17.2 92.4 93.1 88.4 88.5 89.4 82.8 100 100 100 2019-09-03 327203 284789 42414 313428 273701 39727 81 74 35 88 84 50 90 94 36 128 123 39 95 110 20 345 338 143 243 222 83 261 250 98
2 190827 1 001602 CBT 0 -18756.0 -10204.0 5094.0 -3903.0 -13782.0 -1379.0 -934.0 -28823.0 -27444.0 -27889.0 112543 101309 11234 113573 98886 14687 12.8 13.1 32.9 12.5 14.3 25.9 20.6 21.8 47.3 21.1 23.2 37.5 11.4 11.6 32.1 9.9 11.2 22.1 17.8 18.7 45.1 16.5 18.4 31.3 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 118051 108685 22347 101297 97801 16477 81990 6525 101496 26423 20736 5687 42147 34043 8104 358513 312720 45793 31.4 32.4 24.5 31.7 31.6 32.1 32.9 34.8 48.8 28.3 31.3 36.0 28.3 26.2 14.2 7.4 6.6 12.4 11.8 10.9 17.7 92.6 93.4 87.6 88.2 89.1 82.3 100 100 100 2019-08-27 332090 291984 40106 316366 278677 37689 85 81 30 96 94 51 99 104 35 110 106 38 103 116 20 341 336 139 252 238 76 264 255 99
3 190820 1 001602 CBT 0 8679.0 -1358.0 -5449.0 3109.0 -361.0 -1090.0 389.0 1779.0 2869.0 1390.0 131299 119922 11377 123777 107310 16467 12.2 12.5 30.0 11.0 12.1 26.1 19.9 20.6 46.2 19.7 21.2 37.7 10.0 9.8 29.7 8.4 9.4 22.3 15.8 16.4 43.7 13.9 15.5 32.0 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 112957 104967 21051 105200 104347 13914 96015 6202 115278 27802 22317 5485 43081 35549 7532 387336 343221 44115 33.9 34.9 25.8 32.0 31.3 37.3 29.2 30.6 47.7 27.2 30.4 31.5 29.8 28.0 14.1 7.2 6.5 12.4 11.1 10.4 17.1 92.8 93.5 87.6 88.9 89.6 82.9 100 100 100 2019-08-20 359534 320904 38630 344255 307672 36583 95 93 31 98 98 55 98 102 35 113 113 40 118 127 19 350 348 144 271 261 78 273 270 102
4 190813 1 001602 CBT 0 -13926.0 -18764.0 -2482.0 2663.0 4055.0 -1910.0 -2217.0 -14263.0 -12353.0 -12046.0 122620 112079 10541 125135 107483 17652 11.2 11.2 30.6 11.7 12.5 25.1 19.1 19.6 44.8 19.5 21.0 38.1 10.4 10.3 30.0 8.5 10.1 22.9 16.0 16.7 42.2 14.4 16.2 33.3 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 118406 110864 21133 102091 103554 12128 96048 6000 115639 28892 23513 5379 42692 35419 7273 385557 342504 43053 31.8 32.7 24.5 32.5 31.4 41.0 30.7 32.4 49.1 26.5 30.2 28.2 30.0 28.0 13.9 7.5 6.9 12.5 11.1 10.3 16.9 92.5 93.1 87.5 88.9 89.7 83.1 100 100 100 2019-08-13 356665 318991 37674 342865 307085 35780 85 83 31 108 106 56 102 107 33 111 108 35 119 127 19 355 352 139 261 252 75 285 279 99
Continuing on my previous question link (things are explained there), I now have obtained an array. However, I don't know how to use this array, but that is a further question. The point of this question is, there are NaN values in the 63 x 2 column that I created and I want the rows with NaN values deleted so that I can use the data (once I ask another question on how to graph and export as x , y arrays)
Here's what I have. This code works.
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = [df.iloc[:, [0, 1]]]
The sample of the .csv file is located in the link.
I tried inputting
data1.dropna()
but it didn't work.
I want the NaN values/rows to drop so that I'm left with a 28 x 2 array. (I am using the first column with actual values as an example).
Thank you.
Try
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = df.iloc[:, [0, 1]]
cleaned_data = data1.dropna()
You were probably getting an Exception like "List does not have a method 'dropna'". That's because your data1 was not a Pandas DataFrame, but a List - and inside that list was a DataFrame.
However the answer is already given, Though i would like to put some thoughts across this.
Importing Your dataFrame taking the example dataset from your earlier post you provided:
>>> import pandas as pd
>>> df = pd.read_csv("so.csv")
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
22 11.0 30.0 29.7 29.6 ... 39.3 NaN 43.8 44.3
23 11.5 30.0 29.8 29.7 ... 40.2 NaN 43.8 44.3
24 12.0 30.0 29.8 29.7 ... 40.9 NaN 43.9 44.3
25 12.5 30.1 29.8 29.7 ... 41.4 NaN 43.9 44.3
26 13.0 30.1 29.8 29.8 ... 41.8 NaN 43.9 44.4
27 13.5 30.1 29.9 29.8 ... 42.0 NaN 43.9 44.4
28 14.0 30.1 29.9 29.8 ... 42.1 NaN NaN 44.4
29 14.5 NaN 29.9 29.8 ... 42.3 NaN NaN 44.4
30 15.0 NaN 29.9 NaN ... 42.4 NaN NaN NaN
31 15.5 NaN NaN NaN ... 42.4 NaN NaN NaN
However, It good to clean the data beforehand and then process the data as you desired hence dropping the NA values during import itself will be significantly useful.
>>> df = pd.read_csv("so.csv").dropna() <-- dropping the NA here itself
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
and lastly cast your dataFrame as you wish:
>>> df = [df.iloc[:, [0, 1]]]
# new_df = [df.iloc[:, [0, 1]]] <-- if you don't want to alter actual dataFrame
>>> df
[ time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0]
Better Solution:
While looking at the end result, i see you are just concerning about the particular columns those are 'time' & '1mnaoh trial 1' hence idealistic would be to use usecole option which will reduce your memory footprint for the search across the data because you just opted the only columns which are useful for you and then use dropna() which will give you wanted you wanted i believe.
>>> df = pd.read_csv("so.csv", usecols=['time', '1mnaoh trial 1']).dropna()
>>> df
time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0
22 11.0 30.0
23 11.5 30.0
24 12.0 30.0
25 12.5 30.1
26 13.0 30.1
27 13.5 30.1
28 14.0 30.1
I have this data file and I have to find the 3 largest numbers it contains
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
Therefore I have written the following code, but it only searches the first row of numbers instead of the entire list. Can anyone help to find the error?
def three_highest_temps(f):
file = open(f, "r")
largest = 0
second_largest = 0
third_largest = 0
temp = []
for line in file:
temps = line.split()
for i in temps:
if i > largest:
largest = i
elif largest > i > second_largest:
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
print(three_highest_temps("data5.txt"))
Your data contains float numbers not integer.
You can use sorted:
>>> data = '''24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
... 16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
... 10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
... 21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
... 19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
... 14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
... 8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
... 11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
... 13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
... 22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
... 17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
... 20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
... '''
>>> sorted(map(float, data.split()), reverse=True)[:3]
[74.0, 73.7, 73.7]
If you want to integer results
>>> temps = sorted(map(float, data.split()), reverse=True)[:3]
>>> map(int, temps)
[74, 73, 73]
You only get the max elements for the first line because you return at the end of the first iteration. You should de-indent the return statement.
Sorting the data and picking the first 3 elements runs in n*log(n).
data = [float(v) for v in line.split() for line in file]
sorted(data, reverse=True)[:3]
It is perfectly fine for 144 elements.
You can also get the answer in linear time using a heapq
import heapq
heapq.nlargest(3, data)
Your return statement is inside the for loop. Once return is reached, the function terminates, so the loop never gets into a second iteration. Move the return outside the loop by reducing indentation.
for line in file:
temps = line.split()
for i in temps:
if i > largest:
largest = i
elif largest > i > second_largest:
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
In addition, your comparisons won't work, because line.split() returns a list of strings, not floats. (As has been pointed out, your data consists of floats, not ints. I'm assuming the task is to find the largest float.) So let's convert the strings using float()
Your code still won't be correct, though, because when you find a new largest value, you completely discard the old one. Instead you should now consider it the second largest known value. Same rule applies for second to third largest.
for line in file:
temps = line.split()
for temp_string in temps:
i = float(temp_string)
if i > largest:
third_largest = second_largest
second_largest = largest
largest = i
elif largest > i > second_largest:
third_largest = second_largest
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
Now there is one last issue:
You overlook cases where i is identical with one of the largest values. In such a case i > largest would be false, but so would largest > i. You could change either of these comparisons to >= to fix this.
Instead, let us simplify the if clauses by considering that the elif conditions are only considered after all previous conditions were already found to be false. When we reach the first elif, we already know that i can not be larger than largest, so it suffices to compare it to second largest. The same goes for the second elif.
for line in file:
temps = line.split()
for temp_string in temps:
i = float(temp_string)
if i > largest:
third_largest = second_largest
second_largest = largest
largest = i
elif i > second_largest:
third_largest = second_largest
second_largest = i
elif i > third_largest:
third_largest = i
return largest, second_largest, third_largest
This way we avoid accidentally filtering out the i == largest and i == second_largest edge cases.
Since you are dealing with a file, as a cast and numpythonic approach you can load the file as an array and then sort the array and get the last 3 item :
import numpy as np
with open('filename') as f:
array = np.genfromtxt(f).ravel()
array.sort()
print array[-3:]
[ 73.7 73.7 74. ]