Continuing on my previous question link (things are explained there), I now have obtained an array. However, I don't know how to use this array, but that is a further question. The point of this question is, there are NaN values in the 63 x 2 column that I created and I want the rows with NaN values deleted so that I can use the data (once I ask another question on how to graph and export as x , y arrays)
Here's what I have. This code works.
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = [df.iloc[:, [0, 1]]]
The sample of the .csv file is located in the link.
I tried inputting
data1.dropna()
but it didn't work.
I want the NaN values/rows to drop so that I'm left with a 28 x 2 array. (I am using the first column with actual values as an example).
Thank you.
Try
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = df.iloc[:, [0, 1]]
cleaned_data = data1.dropna()
You were probably getting an Exception like "List does not have a method 'dropna'". That's because your data1 was not a Pandas DataFrame, but a List - and inside that list was a DataFrame.
However the answer is already given, Though i would like to put some thoughts across this.
Importing Your dataFrame taking the example dataset from your earlier post you provided:
>>> import pandas as pd
>>> df = pd.read_csv("so.csv")
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
22 11.0 30.0 29.7 29.6 ... 39.3 NaN 43.8 44.3
23 11.5 30.0 29.8 29.7 ... 40.2 NaN 43.8 44.3
24 12.0 30.0 29.8 29.7 ... 40.9 NaN 43.9 44.3
25 12.5 30.1 29.8 29.7 ... 41.4 NaN 43.9 44.3
26 13.0 30.1 29.8 29.8 ... 41.8 NaN 43.9 44.4
27 13.5 30.1 29.9 29.8 ... 42.0 NaN 43.9 44.4
28 14.0 30.1 29.9 29.8 ... 42.1 NaN NaN 44.4
29 14.5 NaN 29.9 29.8 ... 42.3 NaN NaN 44.4
30 15.0 NaN 29.9 NaN ... 42.4 NaN NaN NaN
31 15.5 NaN NaN NaN ... 42.4 NaN NaN NaN
However, It good to clean the data beforehand and then process the data as you desired hence dropping the NA values during import itself will be significantly useful.
>>> df = pd.read_csv("so.csv").dropna() <-- dropping the NA here itself
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
and lastly cast your dataFrame as you wish:
>>> df = [df.iloc[:, [0, 1]]]
# new_df = [df.iloc[:, [0, 1]]] <-- if you don't want to alter actual dataFrame
>>> df
[ time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0]
Better Solution:
While looking at the end result, i see you are just concerning about the particular columns those are 'time' & '1mnaoh trial 1' hence idealistic would be to use usecole option which will reduce your memory footprint for the search across the data because you just opted the only columns which are useful for you and then use dropna() which will give you wanted you wanted i believe.
>>> df = pd.read_csv("so.csv", usecols=['time', '1mnaoh trial 1']).dropna()
>>> df
time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0
22 11.0 30.0
23 11.5 30.0
24 12.0 30.0
25 12.5 30.1
26 13.0 30.1
27 13.5 30.1
28 14.0 30.1
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 days ago.
Improve this question
I would like to convert the 2 dimensional array matrix from C++ to Python language.
Here is my C++ code:
#include <iostream>
using namespace std;
int x;
int y;
double z[11][11];
int main()
{
z[0][0] = 10;
z[10][0] = 20;
z[0][10] = 20;
z[10][10] = 60;
for (y=0; y<11; y++)
{
z[0][y] = (20-10)/10.0*y+10;
z[10][y] = (60-20)/10.0*y+20;
}
for (y=0; y<11; y++)
{
for (x=1; x<11; x++)
{
z[x][y] = (z[10][y]-z[0][y])/10.0 + z[x-1][y];
}
}
for (y=0; y<11; y++)
{
for (x=0; x<11; x++)
{
cout<<(z[x][y])<<"\t";
if (x==10)
{
cout<<endl;
}
}
}
}
Here is the result:
10 11 12 13 14 15 16 17 18 19 20
11 12.3 13.6 14.9 16.2 17.5 18.8 20.1 21.4 22.7 24
12 13.6 15.2 16.8 18.4 20 21.6 23.2 24.8 26.4 28
13 14.9 16.8 18.7 20.6 22.5 24.4 26.3 28.2 30.1 32
14 16.2 18.4 20.6 22.8 25 27.2 29.4 31.6 33.8 36
15 17.5 20 22.5 25 27.5 30 32.5 35 37.5 40
16 18.8 21.6 24.4 27.2 30 32.8 35.6 38.4 41.2 44
17 20.1 23.2 26.3 29.4 32.5 35.6 38.7 41.8 44.9 48
18 21.4 24.8 28.2 31.6 35 38.4 41.8 45.2 48.6 52
19 22.7 26.4 30.1 33.8 37.5 41.2 44.9 48.6 52.3 56
20 24 28 32 36 40 44 48 52 56 60
I do not know how to write the double for loop in Python from C++:
for (y=0; y<11; y++)
{
for (x=1; x<11; x++)
{
z[x][y] = (z[10][y]-z[0][y])/10.0 + z[x-1][y];
}
}
I would like to make z[x][y] = (z[10][y]-z[0][y])/10.0 + z[x-1][y] into Python format please.
I tried:
z=[[0 for x in range(0,11)] for y in range(0,11)]
for y in range (0,11):
x=0
z[x][y]=((20-10)/10.0*y+10)
for y in range (0,11):
x=10
z[x][y]=((60-20)/10.0*y+20)
for y in range (0,11):
for x in range (1, 11):
z[x][y] = (((np.asarray(z[10][y])-np.asarray(z[0][y]))/10.0 + np.asarray(z[x-1][y])))
for y in range (0,11):
for x in range (0, 11):
print (z[x][y])
and it printed
10.0
11.0
12.0
13.0
14.0
15.0
16.0
17.0
18.0
19.0
20.0
11.0
12.3
13.600000000000001
14.900000000000002
16.200000000000003
17.500000000000004
18.800000000000004
20.100000000000005
21.400000000000006
22.700000000000006
24.000000000000007
12.0
13.6
15.2
16.8
18.400000000000002
20.000000000000004
21.600000000000005
23.200000000000006
24.800000000000008
26.40000000000001
28.00000000000001
13.0
14.9
16.8
18.7
20.599999999999998
22.499999999999996
24.399999999999995
26.299999999999994
28.199999999999992
30.09999999999999
31.99999999999999
14.0
16.2
18.4
20.599999999999998
22.799999999999997
24.999999999999996
27.199999999999996
29.399999999999995
31.599999999999994
33.8
36.0
15.0
17.5
20.0
22.5
25.0
27.5
30.0
32.5
35.0
37.5
40.0
16.0
18.8
21.6
24.400000000000002
27.200000000000003
30.000000000000004
32.800000000000004
35.6
38.4
41.199999999999996
43.99999999999999
17.0
20.1
23.200000000000003
26.300000000000004
29.400000000000006
32.50000000000001
35.60000000000001
38.70000000000001
41.80000000000001
44.90000000000001
48.000000000000014
18.0
21.4
24.799999999999997
28.199999999999996
31.599999999999994
34.99999999999999
38.39999999999999
41.79999999999999
45.19999999999999
48.59999999999999
51.999999999999986
19.0
22.7
26.4
30.099999999999998
33.8
37.5
41.2
44.900000000000006
48.60000000000001
52.30000000000001
56.000000000000014
20.0
24.0
28.0
32.0
36.0
40.0
44.0
48.0
52.0
56.0
60.0
Many thanks.
I have this df:
CODIGO NOMBRE Enero Enero Febrero Febrero Marzo Marzo ....
000130 RICA PLAYA 31.3 21.0 31.7 22.0 31.8 22.0
000132 PUERTO PIZARRO 32.5 19.0 32.2 18.0 32.5 17.0
000134 PAPAYAL 31.7 25.0 31.5 27.0 31.8 26.0
000135 EL SALTO 31.1 27.0 31.5 26.0 31.5 26.0
000136 CAÑAVERAL 32.4 17.0 32.0 16.0 32.3 16.0
... ... ... ... ... ...
158317 SUSAPAYA 17.3 20.0 16.8 20.0 17.2 19.0
158321 PALCA 17.9 16.0 17.8 17.0 18.4 16.0
158323 TALABAYA 17.1 12.0 16.7 12.0 17.2 12.0
158326 CAPAZO 13.7 19.0 13.6 19.0 13.5 17.0
158328 PAUCARANI 13.1 15.0 12.9 15.0 13.4 14.0 ....
with 26 columns.
I want to rename the second Enero to N1, and the second Febrero to N2, second Marzo to N3, etc etc like this:
CODIGO NOMBRE Enero N1 Febrero N2 Marzo N3 ....
000130 RICA PLAYA 31.3 21.0 31.7 22.0 31.8 22.0
000132 PUERTO PIZARRO 32.5 19.0 32.2 18.0 32.5 17.0
000134 PAPAYAL 31.7 25.0 31.5 27.0 31.8 26.0
000135 EL SALTO 31.1 27.0 31.5 26.0 31.5 26.0
000136 CAÑAVERAL 32.4 17.0 32.0 16.0 32.3 16.0
... ... ... ... ... ...
158317 SUSAPAYA 17.3 20.0 16.8 20.0 17.2 19.0
158321 PALCA 17.9 16.0 17.8 17.0 18.4 16.0
158323 TALABAYA 17.1 12.0 16.7 12.0 17.2 12.0
158326 CAPAZO 13.7 19.0 13.6 19.0 13.5 17.0
158328 PAUCARANI 13.1 15.0 12.9 15.0 13.4 14.0 ....
So I did:
df.columns['CODIGO','NOMBRE','Enero','N1','Febrero','N2'...... etc etc]
Is there a more efficient or faster way to do this than writing every name?
Assuming duplicated values are in the correct order they can be replaced by modifying the values of columns where duplicated:
m = df.columns.duplicated()
df.columns.values[m] = [f'N{i}' for i in range(1, 1 + m.sum())]
Or with arange and Series:
import numpy as np
df.columns.values[m] = 'N' + pd.Series(np.arange(1, 1 + m.sum()), dtype=str)
Or with cumsum:
df.columns.values[m] = 'N' + pd.Series(m.cumsum()[m], dtype=str)
import pandas as pd
df = pd.DataFrame(columns=['CODIGO', 'NOMBRE', 'Enero', 'Enero',
'Febrero', 'Febrero', 'Marzo', 'Marzo'])
print('Before', df)
m = df.columns.duplicated()
df.columns.values[m] = [f'N{i}' for i in range(1, 1 + m.sum())]
print('After', df)
Before Empty DataFrame
Columns: [CODIGO, NOMBRE, Enero, Enero, Febrero, Febrero, Marzo, Marzo]
Index: []
After Empty DataFrame
Columns: [CODIGO, NOMBRE, Enero, N1, Febrero, N2, Marzo, N3]
Index: []
I am trying to create a time-series from historical data stored in Excel sheets on a website. The website has the Excel spreadsheets organized by year (i.e., Financial futures positions for 2009, 2010, 2011,....).
Is there a way to pull all the relevant files at once to be used in a DataFrame?
I am pretty new to Python and my first thought was to download each file manually as an Excel doc and then read them to a DF with python. Was trying to find a more elegant solution to this process.
Website URL: https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm
The page has several groups of files. I'm trying to find a way to select specific files/groups of files. I'm currently Googling around for solutions involving breaking down the website HTML using Beautiful Soup or something like that.
There's probably a more elegant way to find the <p> tag with the associated table of zip files/links you want, but this seemed to get the job done.
You also might just want to double check it's all there. For some it was throwing a warning: "WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero", but looks to still be there.
Code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from zipfile import ZipFile
from io import BytesIO
url = 'https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the p tag with the specific table you wanted
p_tag = soup.find_all('p')
for p in p_tag:
if 'The complete Commitments of Traders Futures Only reports' in p.text:
break
# Get that table with the zip links
table = p.find_next_sibling('table')
a_tags = table.find_all('a', text = 'Excel')
# Create list of those links
files_list = []
for a in a_tags:
href = 'https://www.cftc.gov' + a['href']
files_list.append(href)
# Iterate through those links, get he table within the zip, and append to a results dataframe
results = pd.DataFrame()
for file_name in files_list[:-1]:
year = file_name.split('_')[-1].split('.')[0]
content = requests.get(file_name)
zf = ZipFile(BytesIO(content.content))
excel_file = zf.namelist()[0]
temp_df = pd.read_excel(zf.open(excel_file))
results = results.append(temp_df, sort=True).reset_index(drop=True)
print ('Recieved: %s' %year)
Output:
print (results.head(5).to_string())
As_of_Date_In_Form_YYMMDD CFTC_Commodity_Code CFTC_Contract_Market_Code CFTC_Market_Code CFTC_Region_Code Change_in_Comm_Long_All Change_in_Comm_Short_All Change_in_NonComm_Long_All Change_in_NonComm_Short_All Change_in_NonComm_Spead_All Change_in_NonRept_Long_All Change_in_NonRept_Short_All Change_in_Open_Interest_All Change_in_Tot_Rept_Long_All Change_in_Tot_Rept_Short_All Comm_Positions_Long_All Comm_Positions_Long_Old Comm_Positions_Long_Other Comm_Positions_Short_All Comm_Positions_Short_Old Comm_Positions_Short_Other Conc_Gross_LE_4_TDR_Long_All Conc_Gross_LE_4_TDR_Long_Old Conc_Gross_LE_4_TDR_Long_Other Conc_Gross_LE_4_TDR_Short_All Conc_Gross_LE_4_TDR_Short_Old Conc_Gross_LE_4_TDR_Short_Other Conc_Gross_LE_8_TDR_Long_All Conc_Gross_LE_8_TDR_Long_Old Conc_Gross_LE_8_TDR_Long_Other Conc_Gross_LE_8_TDR_Short_All Conc_Gross_LE_8_TDR_Short_Old Conc_Gross_LE_8_TDR_Short_Other Conc_Net_LE_4_TDR_Long_All Conc_Net_LE_4_TDR_Long_Old Conc_Net_LE_4_TDR_Long_Other Conc_Net_LE_4_TDR_Short_All Conc_Net_LE_4_TDR_Short_Old Conc_Net_LE_4_TDR_Short_Other Conc_Net_LE_8_TDR_Long_All Conc_Net_LE_8_TDR_Long_Old Conc_Net_LE_8_TDR_Long_Other Conc_Net_LE_8_TDR_Short_All Conc_Net_LE_8_TDR_Short_Old Conc_Net_LE_8_TDR_Short_Other Contract_Units Market_and_Exchange_Names NonComm_Positions_Long_All NonComm_Positions_Long_Old NonComm_Positions_Long_Other NonComm_Positions_Short_All NonComm_Positions_Short_Old NonComm_Positions_Short_Other NonComm_Positions_Spread_Old NonComm_Positions_Spread_Other NonComm_Postions_Spread_All NonRept_Positions_Long_All NonRept_Positions_Long_Old NonRept_Positions_Long_Other NonRept_Positions_Short_All NonRept_Positions_Short_Old NonRept_Positions_Short_Other Open_Interest_All Open_Interest_Old Open_Interest_Other Pct_of_OI_Comm_Long_All Pct_of_OI_Comm_Long_Old Pct_of_OI_Comm_Long_Other Pct_of_OI_Comm_Short_All Pct_of_OI_Comm_Short_Old Pct_of_OI_Comm_Short_Other Pct_of_OI_NonComm_Long_All Pct_of_OI_NonComm_Long_Old Pct_of_OI_NonComm_Long_Other Pct_of_OI_NonComm_Short_All Pct_of_OI_NonComm_Short_Old Pct_of_OI_NonComm_Short_Other Pct_of_OI_NonComm_Spread_All Pct_of_OI_NonComm_Spread_Old Pct_of_OI_NonComm_Spread_Other Pct_of_OI_NonRept_Long_All Pct_of_OI_NonRept_Long_Old Pct_of_OI_NonRept_Long_Other Pct_of_OI_NonRept_Short_All Pct_of_OI_NonRept_Short_Old Pct_of_OI_NonRept_Short_Other Pct_of_OI_Tot_Rept_Long_All Pct_of_OI_Tot_Rept_Long_Old Pct_of_OI_Tot_Rept_Long_Other Pct_of_OI_Tot_Rept_Short_All Pct_of_OI_Tot_Rept_Short_Old Pct_of_OI_Tot_Rept_Short_Other Pct_of_Open_Interest_All Pct_of_Open_Interest_Old Pct_of_Open_Interest_Other Report_Date_as_MM_DD_YYYY Tot_Rept_Positions_Long_All Tot_Rept_Positions_Long_Old Tot_Rept_Positions_Long_Other Tot_Rept_Positions_Short_All Tot_Rept_Positions_Short_Old Tot_Rept_Positions_Short_Other Traders_Comm_Long_All Traders_Comm_Long_Old Traders_Comm_Long_Other Traders_Comm_Short_All Traders_Comm_Short_Old Traders_Comm_Short_Other Traders_NonComm_Long_All Traders_NonComm_Long_Old Traders_NonComm_Long_Other Traders_NonComm_Short_All Traders_NonComm_Short_Old Traders_NonComm_Short_Other Traders_NonComm_Spead_Old Traders_NonComm_Spread_All Traders_NonComm_Spread_Other Traders_Tot_All Traders_Tot_Old Traders_Tot_Other Traders_Tot_Rept_Long_All Traders_Tot_Rept_Long_Old Traders_Tot_Rept_Long_Other Traders_Tot_Rept_Short_All Traders_Tot_Rept_Short_Old Traders_Tot_Rept_Short_Other
0 190910 1 001602 CBT 0 -4068.0 4892.0 6487.0 -2906.0 8280.0 -944.0 -511.0 9755.0 10699.0 10266.0 107772 92050 15722 102893 86995 15898 11.9 12.6 31.6 11.6 12.6 26.7 21.1 22.3 46.5 20.0 21.9 39.6 10.8 11.1 31.5 10.0 10.4 23.4 17.4 18.6 45.4 15.5 16.7 36.1 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 121056 112747 22502 111727 105822 20098 88722 6159 109074 25816 20362 5454 40024 32342 7682 363718 313881 49837 29.6 29.3 31.5 28.3 27.7 31.9 33.3 35.9 45.2 30.7 33.7 40.3 30.0 28.3 12.4 7.1 6.5 10.9 11.0 10.3 15.4 92.9 93.5 89.1 89.0 89.7 84.6 100 100 100 2019-09-10 337902 293519 44383 323694 281539 42155 80 72 34 92 87 49 101 104 35 115 109 41 103 118 20 346 338 145 252 232 83 262 247 99
1 190903 1 001602 CBT 0 -703.0 -15572.0 -3482.0 13336.0 -702.0 337.0 -1612.0 -4550.0 -4887.0 -2938.0 111840 97821 14019 98001 82374 15627 13.2 13.7 32.6 11.9 13.0 25.9 21.9 22.9 45.3 20.1 22.2 37.7 12.1 12.2 32.6 9.9 10.4 22.2 18.5 19.0 44.3 16.1 17.7 33.7 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 114569 103964 22404 114633 108323 18109 83004 5991 100794 26760 21199 5561 40535 32287 8248 353963 305988 47975 31.6 32.0 29.2 27.7 26.9 32.6 32.4 34.0 46.7 32.4 35.4 37.7 28.5 27.1 12.5 7.6 6.9 11.6 11.5 10.6 17.2 92.4 93.1 88.4 88.5 89.4 82.8 100 100 100 2019-09-03 327203 284789 42414 313428 273701 39727 81 74 35 88 84 50 90 94 36 128 123 39 95 110 20 345 338 143 243 222 83 261 250 98
2 190827 1 001602 CBT 0 -18756.0 -10204.0 5094.0 -3903.0 -13782.0 -1379.0 -934.0 -28823.0 -27444.0 -27889.0 112543 101309 11234 113573 98886 14687 12.8 13.1 32.9 12.5 14.3 25.9 20.6 21.8 47.3 21.1 23.2 37.5 11.4 11.6 32.1 9.9 11.2 22.1 17.8 18.7 45.1 16.5 18.4 31.3 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 118051 108685 22347 101297 97801 16477 81990 6525 101496 26423 20736 5687 42147 34043 8104 358513 312720 45793 31.4 32.4 24.5 31.7 31.6 32.1 32.9 34.8 48.8 28.3 31.3 36.0 28.3 26.2 14.2 7.4 6.6 12.4 11.8 10.9 17.7 92.6 93.4 87.6 88.2 89.1 82.3 100 100 100 2019-08-27 332090 291984 40106 316366 278677 37689 85 81 30 96 94 51 99 104 35 110 106 38 103 116 20 341 336 139 252 238 76 264 255 99
3 190820 1 001602 CBT 0 8679.0 -1358.0 -5449.0 3109.0 -361.0 -1090.0 389.0 1779.0 2869.0 1390.0 131299 119922 11377 123777 107310 16467 12.2 12.5 30.0 11.0 12.1 26.1 19.9 20.6 46.2 19.7 21.2 37.7 10.0 9.8 29.7 8.4 9.4 22.3 15.8 16.4 43.7 13.9 15.5 32.0 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 112957 104967 21051 105200 104347 13914 96015 6202 115278 27802 22317 5485 43081 35549 7532 387336 343221 44115 33.9 34.9 25.8 32.0 31.3 37.3 29.2 30.6 47.7 27.2 30.4 31.5 29.8 28.0 14.1 7.2 6.5 12.4 11.1 10.4 17.1 92.8 93.5 87.6 88.9 89.6 82.9 100 100 100 2019-08-20 359534 320904 38630 344255 307672 36583 95 93 31 98 98 55 98 102 35 113 113 40 118 127 19 350 348 144 271 261 78 273 270 102
4 190813 1 001602 CBT 0 -13926.0 -18764.0 -2482.0 2663.0 4055.0 -1910.0 -2217.0 -14263.0 -12353.0 -12046.0 122620 112079 10541 125135 107483 17652 11.2 11.2 30.6 11.7 12.5 25.1 19.1 19.6 44.8 19.5 21.0 38.1 10.4 10.3 30.0 8.5 10.1 22.9 16.0 16.7 42.2 14.4 16.2 33.3 (CONTRACTS OF 5,000 BUSHELS) WHEAT-SRW - CHICAGO BOARD OF TRADE 118406 110864 21133 102091 103554 12128 96048 6000 115639 28892 23513 5379 42692 35419 7273 385557 342504 43053 31.8 32.7 24.5 32.5 31.4 41.0 30.7 32.4 49.1 26.5 30.2 28.2 30.0 28.0 13.9 7.5 6.9 12.5 11.1 10.3 16.9 92.5 93.1 87.5 88.9 89.7 83.1 100 100 100 2019-08-13 356665 318991 37674 342865 307085 35780 85 83 31 108 106 56 102 107 33 111 108 35 119 127 19 355 352 139 261 252 75 285 279 99
I'm new with python. So maybe there is something really basic here I'm missing, but I can't figure it out...For my work I'm trying to read a txt file and apply KNN on it.
The File content is as follow and it has three columns with the third one as the class, the separator is a space.
0.85 17.45 2
0.75 15.6 2
3.3 15.45 2
5.25 14.2 2
4.9 15.65 2
5.35 15.85 2
5.1 17.9 2
4.6 18.25 2
4.05 18.75 2
3.4 19.7 2
2.9 21.15 2
3.1 21.85 2
3.9 21.85 2
4.4 20.05 2
7.2 14.5 2
7.65 16.5 2
7.1 18.65 2
7.05 19.9 2
5.85 20.55 2
5.5 21.8 2
6.55 21.8 2
6.05 22.3 2
5.2 23.4 2
4.55 23.9 2
5.1 24.4 2
8.1 26.35 2
10.15 27.7 2
9.75 25.5 2
9.2 21.1 2
11.2 22.8 2
12.6 23.1 2
13.25 23.5 2
11.65 26.85 2
12.45 27.55 2
13.3 27.85 2
13.7 27.75 2
14.15 26.9 2
14.05 26.55 2
15.15 24.2 2
15.2 24.75 2
12.2 20.9 2
12.15 21.45 2
12.75 22.05 2
13.15 21.85 2
13.75 22 2
13.95 22.7 2
14.4 22.65 2
14.2 22.15 2
14.1 21.75 2
14.05 21.4 2
17.2 24.8 2
17.7 24.85 2
17.55 25.2 2
17 26.85 2
16.55 27.1 2
19.15 25.35 2
18.8 24.7 2
21.4 25.85 2
15.8 21.35 2
16.6 21.15 2
17.45 20.75 2
18 20.95 2
18.25 20.2 2
18 22.3 2
18.6 22.25 2
19.2 21.95 2
19.45 22.1 2
20.1 21.6 2
20.1 20.9 2
19.9 20.35 2
19.45 19.05 2
19.25 18.7 2
21.3 22.3 2
22.9 23.65 2
23.15 24.1 2
24.25 22.85 2
22.05 20.25 2
20.95 18.25 2
21.65 17.25 2
21.55 16.7 2
21.6 16.3 2
21.5 15.5 2
22.4 16.5 2
22.25 18.1 2
23.15 19.05 2
23.5 19.8 2
23.75 20.2 2
25.15 19.8 2
25.5 19.45 2
23 18 2
23.95 17.75 2
25.9 17.55 2
27.65 15.65 2
23.1 14.6 2
23.5 15.2 2
24.05 14.9 2
24.5 14.7 2
14.15 17.35 1
14.3 16.8 1
14.3 15.75 1
14.75 15.1 1
15.35 15.5 1
15.95 16.45 1
16.5 17.05 1
17.35 17.05 1
17.15 16.3 1
16.65 16.1 1
16.5 15.15 1
16.25 14.95 1
16 14.25 1
15.9 13.2 1
15.15 12.05 1
15.2 11.7 1
17 15.65 1
16.9 15.35 1
17.35 15.45 1
17.15 15.1 1
17.3 14.9 1
17.7 15 1
17 14.6 1
16.85 14.3 1
16.6 14.05 1
17.1 14 1
17.45 14.15 1
17.8 14.2 1
17.6 13.85 1
17.2 13.5 1
17.25 13.15 1
17.1 12.75 1
16.95 12.35 1
16.5 12.2 1
16.25 12.5 1
16.05 11.9 1
16.65 10.9 1
16.7 11.4 1
16.95 11.25 1
17.3 11.2 1
18.05 11.9 1
18.6 12.5 1
18.9 12.05 1
18.7 11.25 1
17.95 10.9 1
18.4 10.05 1
17.45 10.4 1
17.6 10.15 1
17.7 9.85 1
17.3 9.7 1
16.95 9.7 1
16.75 9.65 1
19.8 9.95 1
19.1 9.55 1
17.5 8.3 1
17.55 8.1 1
17.85 7.55 1
18.2 8.35 1
19.3 9.1 1
19.4 8.85 1
19.05 8.85 1
18.9 8.5 1
18.6 7.85 1
18.7 7.65 1
19.35 8.2 1
19.95 8.3 1
20 8.9 1
20.3 8.9 1
20.55 8.8 1
18.35 6.95 1
18.65 6.9 1
19.3 7 1
19.1 6.85 1
19.15 6.65 1
21.2 8.8 1
21.4 8.8 1
21.1 8 1
20.4 7 1
20.5 6.35 1
20.1 6.05 1
20.45 5.15 1
20.95 5.55 1
20.95 6.2 1
20.9 6.6 1
21.05 7 1
21.85 8.5 1
21.9 8.2 1
22.3 7.7 1
21.85 6.65 1
21.3 5.05 1
22.6 6.7 1
22.5 6.15 1
23.65 7.2 1
24.1 7 1
21.95 4.8 1
22.15 5.05 1
22.45 5.3 1
22.45 4.9 1
22.7 5.5 1
23 5.6 1
23.2 5.3 1
23.45 5.95 1
23.75 5.95 1
24.45 6.15 1
24.6 6.45 1
25.2 6.55 1
26.05 6.4 1
25.3 5.75 1
24.35 5.35 1
23.3 4.9 1
22.95 4.75 1
22.4 4.55 1
22.8 4.1 1
22.9 4 1
23.25 3.85 1
23.45 3.6 1
23.55 4.2 1
23.8 3.65 1
23.8 4.75 1
24.2 4 1
24.55 4 1
24.7 3.85 1
24.7 4.3 1
24.9 4.75 1
26.4 5.7 1
27.15 5.95 1
27.3 5.45 1
27.5 5.45 1
27.55 5.1 1
26.85 4.95 1
26.6 4.9 1
26.85 4.4 1
26.2 4.4 1
26 4.25 1
25.15 4.1 1
25.6 3.9 1
25.85 3.6 1
24.95 3.35 1
25.1 3.25 1
25.45 3.15 1
26.85 2.95 1
27.15 3.15 1
27.2 3 1
27.95 3.25 1
27.95 3.5 1
28.8 4.05 1
28.8 4.7 1
28.75 5.45 1
28.6 5.75 1
29.25 6.3 1
30 6.55 1
30.6 3.4 1
30.05 3.45 1
29.75 3.45 1
29.2 4 1
29.45 4.05 1
29.05 4.55 1
29.4 4.85 1
29.5 4.7 1
29.9 4.45 1
30.75 4.45 1
30.4 4.05 1
30.8 3.95 1
31.05 3.95 1
30.9 5.2 1
30.65 5.85 1
30.7 6.15 1
31.5 6.25 1
31.65 6.55 1
32 7 1
32.5 7.95 1
33.35 7.45 1
32.6 6.95 1
32.65 6.6 1
32.55 6.35 1
32.35 6.1 1
32.55 5.8 1
32.2 5.05 1
32.35 4.25 1
32.9 4.15 1
32.7 4.6 1
32.75 4.85 1
34.1 4.6 1
34.1 5 1
33.6 5.25 1
33.35 5.65 1
33.75 5.95 1
33.4 6.2 1
34.45 5.8 1
34.65 5.65 1
34.65 6.25 1
35.25 6.25 1
34.35 6.8 1
34.1 7.15 1
34.45 7.3 1
34.7 7.2 1
34.85 7 1
34.35 7.75 1
34.55 7.85 1
35.05 8 1
35.5 8.05 1
35.8 7.1 1
36.6 6.7 1
36.75 7.25 1
36.5 7.4 1
35.95 7.9 1
36.1 8.1 1
36.15 8.4 1
37.6 7.35 1
37.9 7.65 1
29.15 4.4 1
34.9 9 1
35.3 9.4 1
35.9 9.35 1
36 9.65 1
35.75 10 1
36.7 9.15 1
36.6 9.8 1
36.9 9.75 1
37.25 10.15 1
36.4 10.15 1
36.3 10.7 1
36.75 10.85 1
38.15 9.7 1
38.4 9.45 1
38.35 10.5 1
37.7 10.8 1
37.45 11.15 1
37.35 11.4 1
37 11.75 1
36.8 12.2 1
37.15 12.55 1
37.25 12.15 1
37.65 11.95 1
37.95 11.85 1
38.6 11.75 1
38.5 12.2 1
38 12.95 1
37.3 13 1
37.5 13.4 1
37.85 14.5 1
38.3 14.6 1
38.05 14.45 1
38.35 14.35 1
38.5 14.25 1
39.3 14.2 1
39 13.2 1
38.95 12.9 1
39.2 12.35 1
39.5 11.8 1
39.55 12.3 1
39.75 12.75 1
40.2 12.8 1
40.4 12.05 1
40.45 12.5 1
40.55 13.15 1
40.45 14.5 1
40.2 14.8 1
40.65 14.9 1
40.6 15.25 1
41.3 15.3 1
40.95 15.7 1
41.25 16.8 1
40.95 17.05 1
40.7 16.45 1
40.45 16.3 1
39.9 16.2 1
39.65 16.2 1
39.25 15.5 1
38.85 15.5 1
38.3 16.5 1
38.75 16.85 1
39 16.6 1
38.25 17.35 1
39.5 16.95 1
39.9 17.05 1
My Code:
import csv
import random
import math
import operator
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(3):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.67
loadDataset('Jain.txt', split, trainingSet, testSet)
print 'Train set: ' + repr(len(trainingSet))
print 'Test set: ' + repr(len(testSet))
# generate predictions
predictions=[]
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
main()
Here:
lines = csv.reader(csvfile)
You have to tell csv.reader what separator to use - else it will use the default excel ',' separator. Note that in the example you posted, the separator might actually NOT be "a space", but either a tab ("\t" in python) or just a random number of spaces - in which case it's not a csv-like format and you'll have to parse lines by yourself.
Also your code is far from pythonic. First thing first: python's 'for' loop are really "for each" kind of loops, ie they directly yields values from the object you iterate on. The proper way to iterate on a list is:
lst = ["a", "b", "c"]
for item in lst:
print(item)
so no need for range() and indexed access here. Note that if you want to have the index too, you can use enumerate(sequence), which will yield (index, item) pairs, ie:
lst = ["a", "b", "c"]
for index, item in enumerate(lst):
print("item at {} is {}".format(index, item))
So your loadDataset() function could be rewritten as:
def loadDataset(filename, split, trainingSet=None , testSet=None):
# fix the mutable default argument gotcha
# cf https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
if trainingSet is None:
trainingSet = []
if testSet is None:
testSet = []
with open(filename, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter="\t")
for row in reader:
row = tuple(float(x) for x in row)
if random.random() < split:
trainingSet.append(row)
else:
testSet.append(row)
# so the caller can get the values back
return trainingSet, testSet
Note that if any value in your file is not a proper representation of a float, you'll still get a ValueError in row = tuple(float(x) for x in row). The solution here is to catch the error and handle it one way or another - either by reraising it with additionnal debugging info (which value is wrong and which line of the file it belongs to) or by logging the error and ignoring this row or however it makes sense in the context of your app / lib:
for row in reader:
try:
row = tuple(float(x) for x in row)
except ValueError as e:
# here we choose to just log the error
# and ignore the row, but you may want
# to do otherwise, your choice...
print("wrong value in line {}: {}".format(reader.line_num, row))
continue
if random.random() < split:
trainingSet.append(row)
else:
testSet.append(row)
Also, if you want to iterate over two lists in parallel (get 'list1[x], list2[x]' pairs), you can use zip():
lst1 = ["a", "b", "c"]
lst2 = ["x", "y", "z"]
for pair in zip(lst1, lst2):
print(pair)
and there are functions to sum() values from an iterable, ie:
lst = [1, 2, 3]
print(sum(lst))
so your euclideanDistance function can be rewritten as:
def euclideanDistance(instance1, instance2, length):
pairs = zip(instance1[:length], instance2[:length])
return math.sqrt(sum(pow(x - y) for x, y in pairs))
etc etc...
I asked about it yesterday, and some1 gave me a great answer.
But I need to ask one more question.
[
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
]
that's what i got from website, and i used this
for line in mystr.split('\n'):
if not line:
continue
print (line.split()[3])enter code here
when i use this, i got every fourth value in every line.
That's almost I want, but if i print it, i also get "in" and "december"
how can I get rid of this two words?
Skip the first two lines.
text = iter(mystr.split('\n'))
next(text)
next(text)
for line in text:
...
...
for line in itertools.islice(mystr.split('\n'), 2, None):
...
Getting something that should be a float but isn't is certainly a ValueError exception try the following
for line in mystr.split('\n'):
if not line:
continue
try:
print (float(line.split()[3]))
except ValueError:
pass
Replace print (line.split()[3])enter code here with:
if line.split()[3] not in ['in', 'december']:
print (line.split()[3])
or, more generic:
value = line.split(3)
try:
value = float(value)
print value
except ValueError:
pass
It is good to use generators in such case, where you can use try: ... except:.... My take would be:
txt = """[
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5
16.1 19.1 24.2 45.4 61.3 66.5
10.4 21.6 37.4 44.7 53.2 68.0"""
def my_numbers(txt):
for line in txt.splitlines():
try:
yield float(line.split()[3])
except (ValueError, IndexError):
# if conversion fails or not enough tokens in line
continue
result = list(my_numbers(txt))
print result # output: [47.5, 45.4, 44.7]