how to display/view `sklearn.utils.Bunch` data set? - python

I am going through a tutorial that uses sklearn.utils.Bunch as a data set:
cal_housing = fetch_california_housing()
I'm running this on a Databricks notebook.
I've read through the documentation that I can find like
https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html and search engines aren't yielding anything useful.
but how can I see/view what's in this data set?

If I understood correctly, you can convert it to pandas dataframe:
df = california_housing.fetch_california_housing()
calf_hous_df = pd.DataFrame(data= df.data, columns=df.feature_names)
calf_hous_df.sample(4)
Moreover, you can see attributes:
df.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR'])

the sklearn.utils.Bunch data can be viewed by using pandas to make it into a dataframe:
data = pd.DataFrame(cal_housing.data,columns=cal_housing.feature_names)
data

Related

Issue with stacking in Pandas

I'm trying to stack my dataset using PANDAS and set the countries as the index.
import pandas as pd
url = 'https://raw.githubusercontent.com/cleibowitz/Module-6/main/Module%206%20Dataset%20-%20GDP%20TRANSPOSED.csv'
data = pd.read_csv(url, index_col = 'Year')
data.columns.name = 'Country'
data = pd.DataFrame(data.stack().rename('value'))
data.reset_index()
data = data.query('Year == 2020')
data.set_index('Country')
data
For some reason, I keep getting this error that it can't find "Country", yet I know it is in the dataset. I'm looking for this output:
Would someone mind helping me with this? Thanks!
You must reset your index before (because index is the group of year and country) :
data.reset_index().set_index('Country')

SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index

I keep getting the following error on Databricks:
SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index, got You are trying to use pandas function .iloc[..., ...], use spark function select, where
this is my code:
import re
import nltk
import heapq
corpus = []
for i in range(0, len(Y)):
describe = re.sub('[^a-zA-Z]', ' ', Y.iloc[i, 0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
The code works fine in Spyder, but not in databricks.
I tried to reproduce the same issue as yours successfully, as the code and figure below.
import numpy as np
import pandas as pd
import databricks.koalas as ks
dates = pd.date_range('20130101', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df = ks.from_pandas(pdf)
print(pdf.iloc[0,0])
print(df.iloc[0,0])
Due to lack of the necessary description of your variable Y, I guess Y is a dataframe, but the differences are pandas dataframe on local Spyder, Koalas dataframe in databricks.
According to the Koalas document for databricks.koalas.DataFrame.iloc, it does not support the operation iloc(int, int) for a Koalas dataframe.
So if you want to do some operation for the first column value of each row in databricks, there are two solutions as below.
Make sure Y is a pandas dataframe in the same script of your databricks.
Y must be a Koalas dataframe as you want, please try to the code as below.
# Here, `Y` is a Koalas dataframe
for row in Y.iterrows():
describe = re.sub('[^a-zA-Z]', ' ', row[1][0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
As you can see my sample code and result below, the function iterrows can help to get get the first column value of each row.

Creating JSON from multiple dataframes python

My code works perfect fine for 1 dataframe using the to_json
However now i would like to have a 2nd dataframe in this result.
So I thought creating a dictionary would be the answer.
However it produces the result below which is not practical.
Any help please
I was hoping to produce something a lot prettier without all the "\"
A simple good example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
df.to_json(orient='records')
A simple bad example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
{"result_1": df.to_json(orient='records')}
I also tried
jsonify({"result_1": df.to_json(orient='records')})
and
{"result_1": [df.to_json(orient='records')]}
Hi I think that you are on the right way.
My advice is to use also json.loads to decode json and create a list of dictionary.
As you said before we can create a pandas dataframe and then use df.to_json to convert itself.
Then use json.loads to json format data and create a dictionary to insert into a list e.g. :
data = {}
jsdf = df.to_json(orient = "records")
data["result"] = json.loads(jsdf)
Adding elements to dictionary as below you will find a situation like this:
{"result1": [{...}], "result2": [{...}]}
PS:
If you want to generate random values for different dataframe you can use faker library from python.
e.g.:
from faker import Faker
faker = Faker()
for n in range(5):
df.append(list(faker.profile().values()))
df = pd.DataFrame(df, columns=faker.profile().keys())

Can I set the index column when reading a CSV using Python dask?

When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?
For example, using pandas:
df = pandas.read_csv(filename, index_col=0)
Ideally using dask could this be:
df = dask.dataframe.read_csv(filename, index_col=0)
I have tried
df = dask.dataframe.read_csv(filename).set_index(?)
but the index column does not have a name (and this seems slow).
No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.
In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead
But this won't be any slower or faster than doing it the other way.
I know I'm a bit late, but this is the first result on google so it should get answered.
If you write your dataframe with:
# index = True is default
my_pandas_df.to_csv('path')
#so this is same
my_pandas_df.to_csv('path', index=True)
And import with Dask:
import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')
It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).
How to figure it out:
my_dask_df = dd.read_csv('path')
my_dask_df.columns
which returns
Index(['Unnamed: 0', 'col 0', 'col 1',
...
'col n'],
dtype='object', length=...)
Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).

Convert from pandas dataframe to LabeledPoint RDD

I am running some tests on a very simple dataset which consists basically of numerical data.
It can be found here.
I was working with pandas, numpy and scikit-learn just fine but when moving to Spark I couldn't set up the data in the correct format to input it to a Decision Tree.
I was doing this which didn't work:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-training-true.data')
raw_data = sc.parallelize(df)
train_dataset = raw_data.map(lambda line: line.split(","))\
.map(lambda line:LabeledPoint(line[10], np.array([float(x) for x in line[0:10]])))
I kept getting IndexError: list index out of range when trying to do access line inside the map function.
I only managed to get it to work when I actually downloaded the file and changed the code as follows:
raw_data = sc.textFile('.../datasets/poker-hand-training.data')
train_dataset = raw_data.map(lambda line: line.split(","))\
.map(lambda line:LabeledPoint(line[10], np.array([float(x) for x in line[0:10]])))
If I don't want to download the dataset, is it possible to get the data ready directly from pandas dataframes using read_csv?
I would suggest you to first convert Pandas DataFrame into Spark DataFrame. You can use sqlContext.createDataFrame method to do that.
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-training-true.data', names=['S1','C1','S2','C2','S3','C3','S4','C4','S5','C5','class'])
s_df = spark.createDataFrame(df)
Now you can use this Dataframe to get your training dataset.
train_dataset = s_df.rdd.map(lambda x: LabeledPoint(x[10], x[:10])).collect()

Categories

Resources