Sorting csv data by headers but getting IndexError

Sorting csv data by headers but getting IndexError - python

It seems that my code fails when I try to set what headers/columns of data I want to use giving me an index error when trying to parse headers
import pandas as pd
import quandl
import math, datetime
import numpy as np
from sklearn import preprocessing , cross_validation, svm
from sklearn.linear_model import LinearRegression
import scipy
import matplotlib.pyplot as plt
from matplotlib import style
import pickle
style.use('ggplot')
df = pd.read_csv('convertcsv.csv',sep='\t')
df = np.array(df)
print(df)
df = df[['Open','High','Low','Close','Volume (BTC)']]
print("ok")
df['HL_PCT'] = (df['High'] - df['Close']) / df['Close'] * 100.0
df['PCT_change'] = (df['Close'] - df['Open']) / df['Open'] * 100.0
df = df[['Close','HL_PCT','PCT_change','Volume (BTC)']]
forecast_col = 'Close'
df.fillna(-999999, inplace=True)
forecast_out = int(math.ceil(0.01*len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
X = np.array(df.drop(['label'],1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out:]
df.dropna(inplace=True)
y = np.array(df['label'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y,
test_size=0.2)
clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
with open('linearregression.pickle','wb') as f:
pickle.dump(clf, f)
pickle_in = open('linearregression.pickle','rb')
clf =pickle.load(pickle_in)
accuracy = clf.score(X_test,y_test)
print(accuracy)
forecast_set = clf.predict(X_lately)
df['Forecast'] = np.nan
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day
for i in forecast_set:
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += one_day
df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)] + [i]
df['Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.pause(1)
plt.show()
print("we done?")`
...
I cant seem to figure out what I am doing wrong, it worked with the previous data set I was using, if it helps here is the format of the csv file that I was pulling from:
Timestamp,Open,High,Low,Close,Volume (BTC),Volume (Currency),Weighted Price
2017-09-30 00:00:00,4162.04,4177.63,4154.28,4176.08,114.81,478389.12,4166.96
2017-09-30 01:00:00,4170.84,4224.6,4170.84,4208.14,348.45,1463989.18,4201.4
I am not too experienced with this sort of stuff, and I tried to find other people with the same error but everyone was having a different sort of problem, I can include more data if it is needed.

You're converting your dataframe to a numpy array with df = np.array(df).
Don't expect a numpy array to function as a pandas dataframe.
Remove
df = np.array(df)
and you should be able to slice your matrix by column name with
df = df[['Open','High','Low','Close','Volume (BTC)']]

Related

TypeError: 'NoneType' object is not callable error on pandas Series

I get the NoneType Error at the temp_df.plot() line. Same happens with the autocprrelation plot as well as when fitting the model. As you can see I tried dropping na and there are no empty or missing values in the series. Thank you
import numpy as np
import pandas
import pandas as pd
from datetime import datetime
from pandas import to_datetime, DataFrame
from pandas.plotting import autocorrelation_plot
from matplotlib import pyplot
from statsmodels.tsa.arima.model import ARIMA
path ='C:\\Users\o.ojuko\OneDrive - apeiron GmbH\Data\\'
stundentlichDF = pd.read_excel(path+'Database.xlsx',sheet_name='StündlichDaily',parse_dates=['Date'])
#Exclude 1st 4 months of 2021
forecastDF = stundentlichDF.drop(stundentlichDF[(stundentlichDF['Calendar Year']==2021) & (stundentlichDF['Month'].isin(["Jan","Feb","Mar","Apr"]))].index)
#forecastDF = stundentlichDF.loc[~((stundentlichDF['Calendar Year'] == 2021) & (stundentlichDF['Month'].isin(["Jan","Feb","Mar","Apr"]))),:]
forecastDF = forecastDF.loc[~((forecastDF['Working Year'] == 2020) & (forecastDF['Month'] == "Dec")),:]
#Select data
for x in forecastDF['StoreName'].unique():
print(x)
for y in forecastDF['Weekday'].unique():
print(y)
for z in forecastDF['Stunde'].unique():
print(z)
temp_df = forecastDF.loc[((forecastDF['StoreName'] == x) & (forecastDF['Weekday'] == y) & (forecastDF['Stunde'] == z)),:]
#Forecasting
temp_df= temp_df[['Date','Aufträge']]
#temp_df['Date'] = pd.to_datetime(temp_df['Date'])
temp_df = pd.Series(temp_df['Aufträge'].values, index=temp_df['Date'])
print(temp_df)
temp_df =temp_df.dropna()
print(type(temp_df))
temp_df.plot()
autocorrelation_plot(temp_df)
df_log = np.log(temp_df)
pyplot.plot(df_log)
model = ARIMA(temp_df,order=(5,1,0))
model_fit = model.fit()
print(model_fit.summary())
residuals = DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
#density plot of residuals
residuals.plot(kind='kde')
pyplot.show()
print(residuals.describe())
a = "Pause"

Simple Linear Regression issue in Python

I have this data:
and I am trying to do a simple linear regression model on it.
Here is my code:
from sklearn.linear_model import LinearRegression
X = df[['Date']]
y = df['ACP Cleaning']
model = LinearRegression()
model.fit(X, y)
X_predict = [['2021-1-1']]
y_predict = model.predict(X_predict)
and this is my error:
ValueError: Unable to convert array of bytes/strings into decimal
numbers with dtype='numeric'

Linear Regression works with numbers, not strings.
You must pre-process your data in order to match the input of the model.
One way to do it is to parse the string and convert it to timestamp:
import datetime
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
return d.timestamp()
X = df[['Date']].apply(process_date)
The same must be done to the data you want to predict.
Update: If your dataset's datatype is correct, then the problem is with the data you are trying to use for prediction (you cannot predict a string).
The following is a complete working example. Pay close attention to the processing done to the X_predict variable.
import datetime
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
rng = pd.date_range('2015-02-24', periods=5, freq='3A')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})
print(df.head())
X = np.array(df['Date']).reshape(-1,1)
y = df['Val']
model = LinearRegression()
model.fit(X, y)
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
# return array
return [d.timestamp()]
X_predict = ['2021-1-1']
X_predict = list(map(process_date, X_predict))
y_predict = model.predict(X_predict)
y_predict
Returns:
Date Val
0 2015-12-31 -0.110503
1 2018-12-31 -0.621394
2 2021-12-31 -1.030068
3 2024-12-31 1.221146
4 2027-12-31 -0.327685
array([-2.6149628])
Update: I used your data to create a csv file:
Date,Val
1-1-2020, 90404.71
2-1-2020, 69904.71
...
And then I loaded with pandas. Everything looks good to me:
def process_date(date_str):
# the date format is month-day-year
d = datetime.datetime.strptime(date_str, '%m-%d-%Y')
return d.timestamp()
df = pd.read_csv("test.csv")
df['Date'] = df['Date'].apply(process_date)
df.head()
Output:
Date Val
0 1.577848e+09 90404.710
1 1.580526e+09 69904.710
2 1.583032e+09 98934.112
3 1.585710e+09 77084.430
4 1.588302e+09 35877.420
Extracting features:
# must reshape 'cause we have only one feature
X = df['Date'].to_numpy().reshape(-1,1)
y = df['Val'].to_numpy()
model = LinearRegression()
model.fit(X, y)
Predicting:
X_predict = ['1-1-2021', '2-1-2021']
X_predict = np.array(list(map(process_date, X_predict)))
X_predict = X_predict.reshape(-1, 1)
y_predict = model.predict(X_predict)
y_predict
Output:
array([55492.2660361 , 53516.12292932])
This is a good prediction. You can use matplotlib to plot your data and convince yourself:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(df['Date'], df['Val'])
plt.show()

Linear Regression needs your arrays to be of numeric type, since you have dates that are stored as strings in your X array, Linear Regression won't work as you expect.
You can convert the X array to numeric type by counting the number of days since the beginning date. You can try something like this in your DataFrame:
df.Date = (df.Date - df.Date[0]).days
And then you can continue as you were doing.
I have assumed that the dates in your Date column are in the datetime format, else you would need to convert it first.

I'm having trouble rounding decimals while encoding in python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import OrdinalEncoder
df = pd.read_csv("mushrooms.csv",index_col=False,header=None)
def n(target):
if target == 'p':
return 1
elif target == 'e':
return 0
df[0] = df[0].apply(n)
#manually encoding the targets
targets = df[0]
inputs = df[df.columns[1:]]
def test_train_split(mydf, inputs, tratio, target):
splitter = StratifiedShuffleSplit(n_splits = 1, test_size = tratio, random_state = 42)
train_index, test_index = next(splitter.split(inputs, target))
strat_train = mydf.iloc[train_index]
strat_test = mydf.iloc[test_index]
return strat_train, strat_test
def print_test_train_dfs(train_df, test_df, target_column = 'None'):
print("\nTraining data:")
train_df.info()
if target_column != 'None':
print(train_df[target_column].value_counts())
print('\nTest data:')
test_df.info()
if target_column != 'None':
print(test_df[target_column].value_counts())
traindf, testdf = test_train_split(df, inputs, 0.2, targets)
enc = OrdinalEncoder()
enc.fit(traindf)
df = enc.transform(testdf)
for i in range(len(df)):
for j in range(len(df[1])):
df[i][j].round(0)
df = pd.DataFrame.from_records(df)
print(df)
df always ends up with decimals like 1.0 instead of just 1 which is what I want.
The dataset I'm using is here
https://www.kaggle.com/uciml/mushroom-classification
I'll also add that after .transform, df is in more of an array than a dataframe

df.astype(int) should load as integer
Refer to this question for more information
Change data type of columns in Pandas

Leave one out cross validation Support vector machine

We were given some code for a support vector machine where we are supposed to implement leave one out cross validation. If I understand it correctly leave one out will create as many test sets as there are samples, which means that for a big data set the process will be costly and most likely take quite long to generate results.
I have tried to implement leave one out to the given svm code with only one iteration and with 773 data points in total. I expected it to take some time but as of 2 h later the code is still running without any result, which makes me believe that it might be stuck in some loop or something...
Is there any suggestion as to what might be wrong? I'm not getting any error code either.
The entire code is as following, with the leave one out part is in the last function at the bottom (executed in jupyter notebook online binder):
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gseapy as gp
from gseapy.plot import gseaplot
import qvalue
from ipywidgets import interact, interact_manual
from ipywidgets import IntSlider, FloatSlider, Dropdown, Text
import sklearn as skl
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.model_selection import LeaveOneOut
from sklearn import svm
interact_enrich=interact_manual.options(manual_name="Enrichment analysis")
interact_plot=interact_manual.options(manual_name="Plot")
interact_calc=interact_manual.options(manual_name="Calculate tests")
interact_gen=interact_manual.options(manual_name="Initialize data")
interact_SVM=interact_manual.options(manual_name="Train SVM")
clinical_data = pd.read_csv('../data/brca_clin.tsv.gz', sep ='\t', index_col=2)
clinical_data = clinical_data.iloc[4:,1:]
expression_data = pd.read_csv('../data/brca.tsv.gz', sep ='\t', index_col=1)
expression_data = expression_data.iloc[:,2:].T
def split_data(clinical_df, expression_df, separator, cond1, cond2):
try:
group1 = clinical_df[separator] == cond1
index1 = clinical_df[group1].index
group2 = clinical_df[separator] == cond2
index2 = clinical_df[group2].index
except:
print('Clinical condition wrong')
expression1 = expression_df.loc[index1].dropna()
expression2 = expression_df.loc[index2].dropna()
expression = pd.concat([expression1, expression2])
X = expression.values
y = np.append(np.repeat(0, len(expression1)), np.repeat(1, len(expression2)))
display(pd.DataFrame([len(index1),len(index2)], columns = ['Number of points'], index = ['Group 1', 'Group 2']))
return X, y
def plot_pca_variance(X, scale=False, ncomp = 1):
if scale:
scaler = StandardScaler()
X = scaler.fit_transform(X)
pca = PCA()
pca.fit(X)
plt.rcParams["figure.figsize"] = (20,10)
sns.set(style='darkgrid', context='talk')
plt.plot(np.arange(1,len(pca.explained_variance_ratio_)+1),np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.vlines(ncomp, 0, plt.gca().get_ylim()[1], color='r', linestyles = 'dashed')
h = np.cumsum(pca.explained_variance_ratio_)[ncomp -1]
plt.hlines(h, 0, plt.gca().get_xlim()[1], color='r', linestyles = 'dashed')
plt.title(str(ncomp) + ' components, ' + str(round(h, 3)) + ' variance explained')
plt.show()
def reduce_data(X, n, scale=True):
if scale:
scaler = StandardScaler()
X = scaler.fit_transform(X)
pca = PCA(n_components=n)
Xr = pca.fit_transform(X)
return Xr
def interact_split_data(Criteria, Group_1, Group_2):
global BRCA_X, BRCA_y
BRCA_X, BRCA_y = split_data(clinical_data, expression_data, Criteria, Group_1, Group_2)
def interact_SVM_1(Rescale, Max_iterations):
max_iter = int(Max_iterations)
loo = LeaveOneOut()
ac_matrix_train, ac_matrix_test = np.array([]), np.array([])
for train_id, test_id in loo.split(BRCA_X, BRCA_y):
X_train, X_test, y_train, y_test = BRCA_X[train_id,:], BRCA_X[test_id,:], BRCA_y[train_id],BRCA_y[test_id]
clf = svm.LinearSVC(C=0.1,max_iter=100000).fit(X_train, y_train) # Train an SVM
y_train_pred = clf.predict(X_train)
ac_matrix_train = confusion_matrix(y_train, y_train_pred)
y_test_pred = clf.predict(X_test)
ac_matrix_test = confusion_matrix(y_test, y_test_pred)
display(pd.DataFrame(np.concatenate((ac_matrix_train,ac_matrix_test), axis =1), columns = ["predicted G1 (training)","predicted G2 (training)", "predicted G1 (test)","predicted G2 (test)"],index=["actual G1","actual G2"]))
interact_gen(interact_split_data, Criteria=Text('PR status by ihc'), Group_1 = Text('Positive'), Group_2=Text('Negative'))
interact_SVM(interact_SVM_1, Rescale = False, Max_iterations = Text('1')) ```

how to fix ''Found input variables with inconsistent numbers of samples: [219, 247]''

As title says when running the following code i get a trouble Found input variables with inconsistent numbers of samples: [219, 247], i have read that the problem should be on the np.array set for X and y, but i cannot address the problem because there is a price for every date so i dont get why it is happening, any help will be appreciated thanks!
import pandas as pd
import quandl, math, datetime
import numpy as np
from sklearn import preprocessing, svm, model_selection
from sklearn.linear_model import LinearRegression
import matplotlib as plt
from matplotlib import style
style.use('ggplot')
df = quandl.get("NASDAQOMX/XNDXT25NNR", authtoken='myapikey')
df = df[['Index Value','High','Low','Total Market Value']]
df['HL_PCT'] = (df['High'] - df['Low']) / df['Index Value'] * 100.0
df = df[['Low','High','HL_PCT']]
forecast_col = 'High'
df.fillna(-99999, inplace=True)
forecast_out = int(math.ceil(0.1*len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace= True)
X = np.array(df.drop(['label'],1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
y=np.array(df['label'])
#X= X[:-forecast_out+1]
df.dropna(inplace=True)
y= np.array(df['label'])
X_train, X_test, y_train, y_test= model_selection.train_test_split(X,
y,test_size=0.2)
clf= LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
forecast_set= clf.predict(X_lately)
print(forecast_set, accuracy, forecast_out)
df['Forecast'] = np.nan
last_data= df.iloc[-1].name
last_unix= last_date.timestamp()
one_day=86400
next_unix= last_unix + one_day
for i in forecast_set:
next_date= datetime.datetime.fromtimestamp(next_unix)
next_unix += one_day
df.loc[next_date]= [np.nan for _ in range(len(df.columns) -1)] +
[i]
df['High'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
the expected result should be a plot of future price prediction for that ticker but besides that it is throwing that error 'Found input variables with inconsistent numbers of samples: [219, 247]'.

Your problem lies in these two lines extracted from your code:
X = X[:-forecast_out]
y= np.array(df['label'])
You're subsetting X, but leaving y "as it is".
You may check that shapes differ indeed by:
X.shape, y.shape
Change the last line to:
y= np.array(df[:-forecast_out]['label'])
and you're fine.
Note as well, instead of these repetitive lines:
y=np.array(df['label'])
#X= X[:-forecast_out+1]
df.dropna(inplace=True) # there is no na at this point
y= np.array(df['label'])
the following line (solution to your problem) is just enough:
y= np.array(df[:-forecast_out]['label'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting csv data by headers but getting IndexError - python

You're converting your dataframe to a numpy array with df = np.array(df). Don't expect a numpy array to function as a pandas dataframe. Remove df = np.array(df) and you should be able to slice your matrix by column name with df = df[['Open','High','Low','Close','Volume (BTC)']]

Related

TypeError: 'NoneType' object is not callable error on pandas Series

Simple Linear Regression issue in Python

I'm having trouble rounding decimals while encoding in python

Leave one out cross validation Support vector machine

how to fix ''Found input variables with inconsistent numbers of samples: [219, 247]''

Categories

Resources