I began a python course in linear and logistic regression but I am encountering what is probably a stupid error. I have to work with this data frame:
http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
And this is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rwq = pd.read_csv('*filepath*/winequality-red.csv')
rows = len(rwq.index)
cols = rwq.shape[1]
When I print rows and cols, rows correctly prints 1599 but for some reason cols always equals 1 (when in fact they are 12).
I also tried 'len(rwq.columns)' and I still get 1.
Am I doing something wrong or is the problem with the file provided?
Related
I have 2 CSV files one called training_data and another called target data Ive read both of them training data contains around 30 columns of data and target data has 1 im trying to correlate between the one column in the target data to all the columns of the training data
import pandas as pd
import tarfile
import numpy as np
import csv
#reading in the data
training_data = pd.read_csv(training_data_path)
training_target = pd.read_csv(training_targets_path)
%matplotlib inline
import matplotlib.pyplot as plt
#plotting histogram
training_data.hist(bins=60,figsize=(30,25))
#after reviewing the histograms it can be seen in the histogram of the average household sizes that around 50 counties have a AvgHousehold size of almost 0
#PctSomeCol18_24, PctEmployed16_Over, PctPrivateCoverageAlone all have missing data
display(training_data)
display(training_target)
TARGET_deathRate = training_target["TARGET_deathRate"]
corr_matrix=training_data.corr(training_target)
Ive tried using the corr function but it is not working
It is better to use correlation in one data set, therefore first of all you have to join these two datasets and then use the correlation function. for joining you can use concat, append and join which I rather use join:
df = training_data.join(training_target) #joining datasets
corr_matrix=df.corr()['TARGET_deathRate']
I have a panda dataframe
import yfinance as yf
import numpy as np
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
import pandas as pd
n = 2
df = yf.Ticker("INFY.NS").history(period='400d', interval='1D')
df['max'] = df.iloc[argrelextrema(df1['Close'].values, np.greater_equal,order=n)[0]]['Close']
print(df)
I have created a column name max which has values as shown in the screenshot. The screenshot is only for reference. Sample data can be obtained by running the code above.
I want to compare max values (which are non Nan) with each other but only in the forward direction.
for example,
777.244202 will be compared with all other values of the "max" column which are higher than 777.244202
print those rows which are having .618 Fibonacci retracement with 777.244202
Is there any simpler method in pandas that can do this?
I am in completing an Assignment for my Data Cleaning class where I must perform a PCA. You shouldn't need the data frame to help with this particular issue. I am a python noob. I've spent the last 3 hours trying to figure this out on my own with no success. The data frame as been cleaned and anomalies are sorted. I'm trying to create a PCA.
Here is my code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from pylab import rcParams
from sklearn.decomposition import PCA
df_pcs = pd.read_csv('/Users/personal/Desktop/Data set/churn_raw_data.csv', usecols=['Churn','Age', 'Income', 'Outage_sec_perweek','Yearly_equip_failure', 'Tenure', 'MonthlyCharge'])
data_numeric = df_pcs[['Age', 'Income', 'Outage_sec_perweek','Yearly_equip_failure', 'Tenure', 'MonthlyCharge']]
data_normed = (df_pcs - df_pcs.mean()) / df_pcs.std()
pca = PCA(n_components=2)
pca.fit(data_normed)
data_pca = pd.DataFrame(pca.transform(data_normed),columns = data_numeric)
The error I'm getting: ValueError: Index data must be 1-dimensional
I'm not entirely sure how to make the index data 1 dimensional. Any help would be greatly appreciated.
I am new to Python and I am trying to perform a spline interpolation. My data contains 3 columns with a number of rows having 'NaN' in one of the columns. I need to ignore/remove the NaN without reducing the length. I have tried a number of ways, but each time the length is reduced. Any help or advice would be grateful received.
import numpy as np
import pandas as pd
import scipy.linalg
import matplotlib.style
import math
data = pd.read_excel('prob_data.xlsx')
np.array(data['A'])
np.array(data['B'])
np.array(data['C'])
x = abun_data['A'][~np.isnan(abun_data['A'])]
print(len(x))
z = abun_data['B'][~np.isnan(abun_data['B'])]
print(len(z))
y = abun_data['C'][~np.isnan(abun_data['C'])]
print(len(y))
You can use SimpleInputer class:
from sklearn.impute import SimpleImputer
inputer = SimpleImputer(strategy='median')
data = pd.read_excel('prob_data.xlsx')
nice_data = pd.DataFrame(imputer.fit_transform(data))
I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)
pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)