How to create a DataFrame in Pandas - python

I am using the playerStat.csv which includes 8 columns from which I only need 2. So I`m trying to create a new DataFrame with only those 2 columns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv("HLTVData/playerStats.csv")
dataset.head(20)
I only need the ADR and the Rating.
So I first create a matrix with the data set.
mat = dataset.as_matrix()
#4 is the ADR and 6 is the Rating
newDAtaSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
But it didn`t work, it threw an exception
NameError Traceback (most recent call last)
<ipython-input-10-1f975cc2514a> in <module>()
1 #4 is the ADR and 6 is the Rating
----> 2 newDataSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
NameError: name 'indexMatrix' is not defined
I also tried using the dataset.
newDataSet = pd.DataFrame(dataset, index=np.array(range(dataset.shape[0])), columns=dataset['ADR'])
/home/tensor/miniconda3/envs/tensorflow35openvc/lib/python3.5/site-packages/pandas/core/internals.py in _make_na_block(self, placement, fill_value)
3984
3985 dtype, fill_value = infer_dtype_from_scalar(fill_value)
-> 3986 block_values = np.empty(block_shape, dtype=dtype)
3987 block_values.fill(fill_value)
3988 return make_block(block_values, placement=placement)
MemoryError:

I think you need parameter usecols in read_csv:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=['ADR','Rating'])
Or:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=[4,6])

Related

Error when trying to use seasonal_decompose from Statsmodels

I have the following dataframe called 'data':
Month
Revenue Index
1920-01-01
1.72
1920-02-01
1.83
1920-03-01
1.94
...
...
2021-10-01
114.20
2021-11-01
115.94
2021-12-01
116.01
This is essentially a monthly revenue index on which I am trying to use seasonal_decompose with the following code:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative')
But unfortunately I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-08e3139bbf77> in <module>()
----> 1 result = seasonal_decompose(data['Consumptieprijsindex'], model='multiplicative')
2 rcParams['figure.figsize'] = 12, 6
3 plt.rc('lines', linewidth=1, color='r')
4
5 fig = result.plot()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/seasonal.py in seasonal_decompose(x, model, filt, freq, two_sided, extrapolate_trend)
125 freq = pfreq
126 else:
--> 127 raise ValueError("You must specify a freq or x must be a "
128 "pandas object with a timeseries index with "
129 "a freq not set to None")
ValueError: You must specify a freq or x must be a pandas object with a timeseries index with a freq not set to None
Does anyone know how to solve this issue? Thanks!
The following code in the comments answered my question:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative', period=12)

creating a dataset function using scikit-learn

so I am pretty new at Python, and I am trying to load a dataset from my computer using scikit. This is what my code looks like:
**whatever.py**
import numpy as np
import csv
from sklearn.datasets.base import Bunch
class Cortex_nuc:
def cortex_nuclear():
with open('C:/Users/User/Desktop/Data_Cortex_Nuclear4.csv') as csv_file:
data_file = csv.reader(csv_file)
temp = next(data_file)
n_samples = int(float(temp[0]))
n_features = int(float(temp[1]))
data = np.empty((n_samples, n_features))
target = np.empty((n_samples,), dtype=np.float64)
for i, sample in enumerate(data_file):
data[i] = np.asarray(sample[:-1], dtype=np.float64)
target[i] = np.asarray(sample[-1], dtype=np.float64)
return Bunch(data=data, target=target)
so then I import it into my project:
from whatever import Cortex_nuc
and after that I try to save it into df:
df = Cortex_nuc.cortex_nuclear()
Btw, this is what the dataset looks like:
this is just a part of the dataset, otherwise it has 77 columns and about a thousand rows.
But I get an error message and I can't seem to figure out why it's happening. Here's the error message:
IndexError Traceback (most recent call last)
<ipython-input-5-a4935f2c187f> in <module>
----> 1 df = Cortex_nuc.cortex_nuclear()
~\whatever.py in cortex_nuclear()
20
21 for i, sample in enumerate(data_file):
---> 22 data[i] = np.asarray(sample[:-1], dtype=np.float64)
23 target[i] = np.asarray(sample[-1], dtype=np.float64)
24
IndexError: index 0 is out of bounds for axis 0 with size 0
Can someone please help me? Thanks!
If you want to create a "sklearn-like" dataset in a Bunch object, you probably want something like this:
import pandas as pd
import numpy as np
from sklearn.utils import Bunch
# For reproducing
from io import StringIO
csv_file = StringIO("""
target,A,B
0,0,0
1,0,1
1,1,0
0,1,1
""")
def load_xor(*, return_X_y=False):
"""Describe your data here."""
_data_file = pd.read_csv(csv_file)
_data = Bunch()
_data["DESCR"] = load_xor.__doc__
_data["data"] = _data_file[["A", "B"]].to_numpy(dtype=np.float64)
_data["target"] = _data_file["target"].to_numpy(dtype=np.float64)
_data["target_names"] = np.array(["false", "true"])
_data["feature_names"] = np.array(list(_data_file.drop(["target"], axis=1)))
if return_X_y:
return _data.data, _data.target
return _data
if __name__ == "__main__":
# Return and unpack the `X`, `y` tuple
X, y = load_xor(return_X_y=True)
print(X, y)
This is because sklearn.datasets typically return Bunch objects with specific attributes/keys (for explanations, see the "Return" section of the load_iris documentation):
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> dir(data)
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']

How to fix this elements?

Getting an error with some elements
import pandas as pd
import numpy as np
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
import datetime
#Import our historical data
data = pd.read_csv('Data/sample.csv')
data.columns = [['Date','open','high','low','close','vol']]
data = data.drop_duplicates(keep=False)
data.Date = pd.to_datetime(data.Date,format='%Y.%m.%d %H:%M:%S.%f')
data = data.set_index(data.Date)
data = data[['open', 'high', 'close', 'vol']]
price = data.close.iloc[:100]
# Find our relative extrema
max_idx = argrelextrema(price.values,np.greater,order=1)
min_idx = argrelextrema(price.values,np.less,order=1)
print(max_idx)
print(min_idx)
The error is
Traceback (most recent call last):
File "untitled.py", line 9, in <module>
data.columns = [['Date','open','high','low','close','vol']]
ValueError: Length mismatch: Expected axis has 1 elements, new values have 6 elements
You want to pass a list, not a list of list or pandas will interpret the nested list as one column name.
data.columns = ['Date','open','high','low','close','vol']
Edit 1
Your CSV file seems to be separated by \t :
data = pd.read_csv('Data/sample.csv', sep=r'\t')
data.columns = ['Date','open','high','low','close','vol']

Scipy in Jupiter

Can someone help me to figure out why i'm having this error code : ValueError: n_components must be < n_features; got 10 >= 0
import pandas as pd
from scipy.sparse import csr_matrix
users = pd.read_table(open('ml-1m/users.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['user_id', 'gender', 'age', 'occupation', 'zip'])
ratings = pd.read_table(open('ml-1m/ratings.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'])
movies = pd.read_table(open('ml-1m/movies.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['movie_id', 'title', 'genres'])
MovieLens = pd.merge(pd.merge(ratings, users), movies)
ratings_mtx_df = MovieLens.pivot_table(values='rating', index='user_id', columns='title', fill_value=0)
movie_index = ratings_mtx_df.columns
from sklearn.decomposition import TruncatedSVD
recom = TruncatedSVD(n_components=10, random_state=101)
R = recom.fit_transform(ratings_mtx_df.values.T)
ValueError Traceback (most recent call last)
<ipython-input-8-0bd6c9bda95a> in <module>()
1 from sklearn.decomposition import TruncatedSVD
2 recom = TruncatedSVD(n_components=10, random_state=101)
----> 3 R = recom.fit_transform(ratings_mtx_df.values.T)
C:\Users\renau\Anaconda3\lib\site-packages\sklearn\decomposition\truncated_svd.py in fit_transform(self, X, y)
168 if k >= n_features:
169 raise ValueError("n_components must be < n_features;"
--> 170 " got %d >= %d" % (k, n_features))
171 U, Sigma, VT = randomized_svd(X, self.n_components,
172 n_iter=self.n_iter,
ValueError: n_components must be < n_features; got 10 >= 0
You're trying to split your data into 10 dimensions, but as per the documentation for TruncatedSVD, the number of features (columns) in your ratings_mtx_df data needs to be greater than the number of dimensions/components you're looking to extract. Try n_components=3 (assuming you've got at least 3 features in your data) and see if that's any better.
Also, you're turning your input data sideways, with the .T argument in:
R = recom.fit_transform(ratings_mtx_df.values.T)
That may result in switching features (columns) for observations(rows) which might explain why the fit_transform method isn't working.

Can pandas groupby transform a DataFrame into a Series?

I would like to use pandas and statsmodels to fit a linear model on subsets of a dataframe and return the predicted values. However, I am having trouble figuring out the right pandas idiom to use. Here is what I am trying to do:
import pandas as pd
import statsmodels.formula.api as sm
import seaborn as sns
tips = sns.load_dataset("tips")
def fit_predict(df):
m = sm.ols("tip ~ total_bill", df).fit()
return pd.Series(m.predict(df), index=df.index)
tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
This raises the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-139-b3d2575e2def> in <module>()
----> 1 tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
3033 return self._transform_general(func, *args, **kwargs)
3034 except:
-> 3035 return self._transform_general(func, *args, **kwargs)
3036
3037 # a reduction transform
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _transform_general(self, func, *args, **kwargs)
2988 group.T.values[:] = res
2989 else:
-> 2990 group.values[:] = res
2991
2992 applied.append(group)
ValueError: could not broadcast input array from shape (62) into shape (62,6)
The error makes sense in that I think .transform wants to map a DataFrame to a DataFrame. But is there a way to do a groupby operation on a DataFrame, pass each chunk into a function that reduces it to a Series (with the same index), and then combine the resulting Series into something that can be inserted into the original dataframe?
The top part here is the same, I'm just using a toy dataset b/c I'm behind a firewall.
tips = pd.DataFrame({ 'day':list('MMMFFF'), 'tip':range(6),
'total_bill':[10,40,20,80,50,40] })
def fit_predict(df):
m = sm.ols("tip ~ total_bill", df).fit()
return pd.Series(m.predict(df), index=df.index)
If you change 'transform' to 'apply', you'll get:
tips.groupby("day").apply(fit_predict)
day
F 3 2.923077
4 4.307692
5 4.769231
M 0 0.714286
1 1.357143
2 0.928571
That's not quite what you want, but if you drop level=0, you can proceed as desired:
tips['predicted'] = tips.groupby("day").apply(fit_predict).reset_index(level=0,drop=True)
day tip total_bill predicted
0 M 0 10 0.714286
1 M 1 40 1.357143
2 M 2 20 0.928571
3 F 3 80 2.923077
4 F 4 50 4.307692
5 F 5 40 4.769231

Categories

Resources