How to load a Statsmodels dataset in Python?

How to load a Statsmodels dataset in Python? - python

I am trying to load a statsmodels dataset as I saw on a tutorial, but I keep getting an error.
import statsmodels as sm
import pandas as pd
data = sm.datasets.co2.load_pandas()
co2 = data.data
co2['ds'] = co2.index
co2.rename(columns={'co2': 'y'}, inplace=True)
co2.tail()
This is the error I am getting:
TypeError: new() got an unexpected keyword argument 'format'

looks like the problem is with original function "load_pandas", the "format" parameter no long exists in the new version of pd.DatetimeIndex, for details please refer to https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DatetimeIndex.html
def load_pandas():
data = load()
# pandas <= 0.12.0 fails in the to_datetime regex on Python 3
index = pd.DatetimeIndex(start=data.data['date'][0].decode('utf-8'),
periods=len(data.data), format='%Y%m%d',
freq='W-SAT')
dataset = pd.DataFrame(data.data['co2'], index=index, columns=['co2'])
#NOTE: this is how I got the missing values in co2.csv
#new_index = pd.DatetimeIndex(start='1958-3-29', end=index[-1],
# freq='W-SAT')
#data.data = dataset.reindex(new_index)
data.data = dataset
return data
so my solution of working around this is below:
load data into pandas DataFrame
co2 = pd.DataFrame(sm.datasets.co2.load().data)
convert bytes into string and then datetime
co2['date'] = pd.to_datetime(co2.date.apply(lambda x: x.decode("utf-8")))
set the date as index
co2.set_index('date',inplace=True)
output:

Related

for scorecardpy.woe_bin package in python I am getting "TypeError: unhashable type: 'numpy.ndarray'"

Python is having scorecardpy library for scorecard development which is alternative to R pacakge scorecardpy
but while running woe_binning from scorecardpy as follow
bins = sc.woebin(df_temp,y=target_var,positive='bad|1')
#df_temp is dataframe with all columns having 'float64' data type
#y is binary variable having data type 'int'
I was getting error for few variables
TypeError: unhashable type: 'numpy.ndarray
others were getting binned properly.
i tried finding. difference between variable where binning was successful vs binning was failing basis few parameters as follows
Null value
dtypes
shape
in order to understand the pattern with following code
unhashable_nparray = []
successfully_binned = []
for j,i in enumerate(df3.columns):
try:
print(i)
df_temp = pd.DataFrame()
df_temp[i] = df3[i]
df_temp[target_var] = y
bins = sc.woebin(df_temp,y=target_var,positive='bad|1')
successfully_binned.append(i)
print(i,"---{a}---{b}----{c}--{d}---{z}----Success".format(z =df3[i].nunique() ,a = df3[i].shape, b=df3[i].isna().sum(),c=type(df3[i]),d=df3[i].dtypes))
except TypeError:
unhashable_nparray.append(i)
print(i,"---{a}---{b}----{c}--{d}---{z}----Fail".format(z =df3[i].nunique(),a = df3[i].shape, b=df3[i].isna().sum(),c=type(df3[i]),d=df3[i].dtypes))
but I could not find any pattern.
what could be the cause of this?

Here is the solution
This error is already reported.
and getting discussed here but unfortunately the language is Chinese it is difficult to read.
this error doesn't necessarily mean that your dataframe is uncleaned.or having not expected datatype
please check your pandas version
import pandas as pd
pd.__version__
if output is > 1.4.0 that means you are using recent version of pandas than what scorecardpy is expecting
as of today scorecardpy expects you to work on pandas version <= 1.3.5
so please install pandas == 1.3.5

TypeError: _append_dispatcher()

I am trying to predict the stock price of Facebook on the 1664th row of the .csv file. I am encountering an error when it comes to appending a np.array. Here's my code:
##predicts price of facebook stock for one day
from sklearn.svm import SVR
import numpy as np
import pandas as pd
##store and show data
df = pd.read_csv (r'fb.csv')
##get and print last row of data
actual_price = df.tail(1)
#print(actual_price)
##prepare and print svr models
##get all of the data except for last row
df = df.head(len(df)-1)
ind = (np.arange((len(df.index))))
df["index"] = ind
##create empty list to store dependent and independent data
# days1 =
days = np.array([])
adj_close_prices = np.array([])
##get the date and adjusted close prices
df_days = df.loc[:, 'index']
df_adj_close = df.loc[:, 'Adj Close']
##create the independent dataset ### this part to specify
for day in df_days:
days = np.append(float(day))
And the error which keeps occurring is the following:
days = np.append(float(day))
File "<__array_function__ internals>", line 4, in append
TypeError: _append_dispatcher() missing 1 required positional argument: 'values'
I have very basic level of Python knowledge and have been using YouTube and online resources to come up with what I have already.

I haven't checked your whole code, but the append problem isn't too hard to fix. Check out numpy.append() documentation and you will notice that it takes 2 parameters that are required (and a third that is optional), in a form of numpy.append(array_to_apend_to, value_to_apend) so, in your code it should look like days = np.append(days, float(day))

mask function doesn't get rid of unwanted data

I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.

Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()

Create Proper Dataframe from SDMX Response, Python 3.6

I want to prepare dataset from the data available in http://stat.data.abs.gov.au/Index.aspx?DataSetCode=ATSI_BIRTHS_SUMM
Data API:
http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetData/ATSI_BIRTHS_SUMM/1+4+5+7+8+9+10+13+14+15+18+19+20.IM+IB.0+1+2+3+4+5+6+7.A/all
from pandasdmx import Request
Agency_Code = 'ABS'
Dataset_Id = 'ATSI_BIRTHS_SUMM'
ABS = Request(Agency_Code)
data_response = ABS.data(resource_id='ATSI_BIRTHS_SUMM')
print(data_response.url)
DF = data_response.write(data_response.data.obs(with_values=True, with_attributes=True), parse_time=False)
Above gives error: ValueError: Type names and field names cannot be a keyword: 'None'
DF = data_response.write(data_response.data.series, parse_time=False), This works but Dimension items coming in column wise.
Support Links:
http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetDataStructure/all
http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetDataStructure/ATSI_BIRTHS_SUMM
http://stat.data.abs.gov.au/Index.aspx?DataSetCode=ATSI_BIRTHS_SUMM
Please suggest better way to retrieve data.

Your example
DF = data_response.write(data_response.data.series, parse_time=False)
Produces a stacked DataFrame, by unstack().reset_index() you will get a "flat" DataFrame.
data_response.write().unstack().reset_index()
MEASURE INDIGENOUS_STATUS ASGS_2011 FREQUENCY TIME_PERIOD 0
0 1 IM 0 A 2001 8334.0
Is this what you are looking for?

ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'numpy.ndarray'

I am creating a python script using pandas to read through a file which has multiple row values.
Once read, I need to build an array of these values and then assign it to a dataframe row value.
The code I have used is
import re
import numpy as np
import pandas as pd
master_data = pd.DataFrame()
temp_df = pd.DataFrame()
new_df = pd.DataFrame()
for f in data:
##Reading the file in pandas which is in excel format
#
file_df = pd.read_excel(f)
filename = file_df['Unnamed: 1'][2]
##Skipping first 24 rows to get the required reading values
column_names = ['start_time','xxx_value']
data_df = pd.read_excel(f, names=column_names, skiprows=25)
array =np.array([])
for i in data_df.iterrows():
array = np.append(array,i[1][1])
temp_df['xxx_value'] = [array]
temp_df['Filename'] = filename
temp_df['sub_id']=
temp_df['Filename'].str.split('_',1).str[1].str.strip()
temp_df['sen_site']=
temp_df['Filename'].str.split('_',1).str[0].str.strip()
temp_df['sampling_interval'] = 15
temp_df['start_time'] = data_df['start_time'][2]
new_df= new_df.append(xxx_df)
new_df.index = new_df.index + 1
new_df=new_df.sort_index()
new_df.index.name='record_id'
new_df = new_df.drop("Filename",1) ##dropping the Filename as it
is not needed to be loaded in postgresql
##Rearrange to postgresql format
column_new_df = new_df.columns.tolist()
column_new_df.
insert(4,column_new_df.pop(column_new_df.index('xxx_value')))
new_df = new_df.reindex(columns = column_new_df)
print(new_df)
This code is not working when I try to insert the array data into Postgresql.
It gives me an error stating:
ProgrammingError: (psycopg2.ProgrammingError) can't adapt type
'numpy.ndarray'

I am not sure where the problem is, as I can't see in your code the part where you insert the data into Postgres.
My guess though is that you are giving Postgres a Numpy array: psycopg2 can't handle Numpy data types, but it should be fairly easy to convert it to native Python types that work with psycopg2 (e.g. by using the .tolist(method), it is difficult to give more precise information without the code).

In my opinion, the most effective way would be to make psycopg2 always aware of np.ndarray(s). One could do that by registering an adapter:
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def addapt_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.ndarray, addapt_numpy_array)
To help working with numpy in general, my default addon to scripts/libraries dependent on psycopg2 is:
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def addapt_numpy_float64(numpy_float64):
return AsIs(numpy_float64)
def addapt_numpy_int64(numpy_int64):
return AsIs(numpy_int64)
def addapt_numpy_float32(numpy_float32):
return AsIs(numpy_float32)
def addapt_numpy_int32(numpy_int32):
return AsIs(numpy_int32)
def addapt_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.float64, addapt_numpy_float64)
register_adapter(np.int64, addapt_numpy_int64)
register_adapter(np.float32, addapt_numpy_float32)
register_adapter(np.int32, addapt_numpy_int32)
register_adapter(np.ndarray, addapt_numpy_array)
otherwise there would be some issues even with numerical types.
I got the adapter trick from this other stackoverflow entry.

Convert each numpy array element to its equivalent list using apply and tolist first, and then you should be able to write the data to Postgres:
df['column_name'] = df['column_name'].apply(lambda x: x.tolist())

We can address the issue by extracting one element at a time. Here I'm assuming for a dataframe temp_df, sub_id of type numpy.int64, we can directly extract the values using the iloc and item as temp_df.iloc[0]['sub_id'].item() and we can push that in DB.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to load a Statsmodels dataset in Python? - python

Related

for scorecardpy.woe_bin package in python I am getting "TypeError: unhashable type: 'numpy.ndarray'"

TypeError: _append_dispatcher()

mask function doesn't get rid of unwanted data

Create Proper Dataframe from SDMX Response, Python 3.6

ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'numpy.ndarray'

Categories

Resources