I am going through a tutorial that uses sklearn.utils.Bunch as a data set:
cal_housing = fetch_california_housing()
I'm running this on a Databricks notebook.
I've read through the documentation that I can find like
https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html and search engines aren't yielding anything useful.
but how can I see/view what's in this data set?
If I understood correctly, you can convert it to pandas dataframe:
df = california_housing.fetch_california_housing()
calf_hous_df = pd.DataFrame(data= df.data, columns=df.feature_names)
calf_hous_df.sample(4)
Moreover, you can see attributes:
df.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
the sklearn.utils.Bunch data can be viewed by using pandas to make it into a dataframe:
data = pd.DataFrame(cal_housing.data,columns=cal_housing.feature_names)
data
Related
I'm trying to stack my dataset using PANDAS and set the countries as the index.
import pandas as pd
url = 'https://raw.githubusercontent.com/cleibowitz/Module-6/main/Module%206%20Dataset%20-%20GDP%20TRANSPOSED.csv'
data = pd.read_csv(url, index_col = 'Year')
data.columns.name = 'Country'
data = pd.DataFrame(data.stack().rename('value'))
data.reset_index()
data = data.query('Year == 2020')
data.set_index('Country')
data
For some reason, I keep getting this error that it can't find "Country", yet I know it is in the dataset. I'm looking for this output:
Would someone mind helping me with this? Thanks!
You must reset your index before (because index is the group of year and country) :
data.reset_index().set_index('Country')
I keep getting the following error on Databricks:
SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index, got You are trying to use pandas function .iloc[..., ...], use spark function select, where
this is my code:
import re
import nltk
import heapq
corpus = []
for i in range(0, len(Y)):
describe = re.sub('[^a-zA-Z]', ' ', Y.iloc[i, 0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
The code works fine in Spyder, but not in databricks.
I tried to reproduce the same issue as yours successfully, as the code and figure below.
import numpy as np
import pandas as pd
import databricks.koalas as ks
dates = pd.date_range('20130101', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df = ks.from_pandas(pdf)
print(pdf.iloc[0,0])
print(df.iloc[0,0])
Due to lack of the necessary description of your variable Y, I guess Y is a dataframe, but the differences are pandas dataframe on local Spyder, Koalas dataframe in databricks.
According to the Koalas document for databricks.koalas.DataFrame.iloc, it does not support the operation iloc(int, int) for a Koalas dataframe.
So if you want to do some operation for the first column value of each row in databricks, there are two solutions as below.
Make sure Y is a pandas dataframe in the same script of your databricks.
Y must be a Koalas dataframe as you want, please try to the code as below.
# Here, `Y` is a Koalas dataframe
for row in Y.iterrows():
describe = re.sub('[^a-zA-Z]', ' ', row[1][0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
As you can see my sample code and result below, the function iterrows can help to get get the first column value of each row.
My code works perfect fine for 1 dataframe using the to_json
However now i would like to have a 2nd dataframe in this result.
So I thought creating a dictionary would be the answer.
However it produces the result below which is not practical.
Any help please
I was hoping to produce something a lot prettier without all the "\"
A simple good example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
df.to_json(orient='records')
A simple bad example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
{"result_1": df.to_json(orient='records')}
I also tried
jsonify({"result_1": df.to_json(orient='records')})
and
{"result_1": [df.to_json(orient='records')]}
Hi I think that you are on the right way.
My advice is to use also json.loads to decode json and create a list of dictionary.
As you said before we can create a pandas dataframe and then use df.to_json to convert itself.
Then use json.loads to json format data and create a dictionary to insert into a list e.g. :
data = {}
jsdf = df.to_json(orient = "records")
data["result"] = json.loads(jsdf)
Adding elements to dictionary as below you will find a situation like this:
{"result1": [{...}], "result2": [{...}]}
PS:
If you want to generate random values for different dataframe you can use faker library from python.
e.g.:
from faker import Faker
faker = Faker()
for n in range(5):
df.append(list(faker.profile().values()))
df = pd.DataFrame(df, columns=faker.profile().keys())
When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards?
For example, using pandas:
df = pandas.read_csv(filename, index_col=0)
Ideally using dask could this be:
df = dask.dataframe.read_csv(filename, index_col=0)
I have tried
df = dask.dataframe.read_csv(filename).set_index(?)
but the index column does not have a name (and this seems slow).
No, these need to be two separate methods. If you try this then Dask will tell you in a nice error message.
In [1]: import dask.dataframe as dd
In [2]: df = dd.read_csv('*.csv', index='my-index')
ValueError: Keyword 'index' not supported dd.read_csv(...).set_index('my-index') instead
But this won't be any slower or faster than doing it the other way.
I know I'm a bit late, but this is the first result on google so it should get answered.
If you write your dataframe with:
# index = True is default
my_pandas_df.to_csv('path')
#so this is same
my_pandas_df.to_csv('path', index=True)
And import with Dask:
import dask.dataframe as dd
my_dask_df = dd.read_csv('path').set_index('Unnamed: 0')
It will use column 0 as your index (which is unnamed thanks to pandas.DataFrame.to_csv() ).
How to figure it out:
my_dask_df = dd.read_csv('path')
my_dask_df.columns
which returns
Index(['Unnamed: 0', 'col 0', 'col 1',
...
'col n'],
dtype='object', length=...)
Now you can write: df = pandas.read_csv(filename, index_col='column_name') (Where column name is the name of the column you want to set as the index).
I am running some tests on a very simple dataset which consists basically of numerical data.
It can be found here.
I was working with pandas, numpy and scikit-learn just fine but when moving to Spark I couldn't set up the data in the correct format to input it to a Decision Tree.
I was doing this which didn't work:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-training-true.data')
raw_data = sc.parallelize(df)
train_dataset = raw_data.map(lambda line: line.split(","))\
.map(lambda line:LabeledPoint(line[10], np.array([float(x) for x in line[0:10]])))
I kept getting IndexError: list index out of range when trying to do access line inside the map function.
I only managed to get it to work when I actually downloaded the file and changed the code as follows:
raw_data = sc.textFile('.../datasets/poker-hand-training.data')
train_dataset = raw_data.map(lambda line: line.split(","))\
.map(lambda line:LabeledPoint(line[10], np.array([float(x) for x in line[0:10]])))
If I don't want to download the dataset, is it possible to get the data ready directly from pandas dataframes using read_csv?
I would suggest you to first convert Pandas DataFrame into Spark DataFrame. You can use sqlContext.createDataFrame method to do that.
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-training-true.data', names=['S1','C1','S2','C2','S3','C3','S4','C4','S5','C5','class'])
s_df = spark.createDataFrame(df)
Now you can use this Dataframe to get your training dataset.
train_dataset = s_df.rdd.map(lambda x: LabeledPoint(x[10], x[:10])).collect()