Convert from pandas dataframe to LabeledPoint RDD - python

I am running some tests on a very simple dataset which consists basically of numerical data.
It can be found here.
I was working with pandas, numpy and scikit-learn just fine but when moving to Spark I couldn't set up the data in the correct format to input it to a Decision Tree.
I was doing this which didn't work:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-training-true.data')
raw_data = sc.parallelize(df)
train_dataset = raw_data.map(lambda line: line.split(","))\
.map(lambda line:LabeledPoint(line[10], np.array([float(x) for x in line[0:10]])))
I kept getting IndexError: list index out of range when trying to do access line inside the map function.
I only managed to get it to work when I actually downloaded the file and changed the code as follows:
raw_data = sc.textFile('.../datasets/poker-hand-training.data')
train_dataset = raw_data.map(lambda line: line.split(","))\
.map(lambda line:LabeledPoint(line[10], np.array([float(x) for x in line[0:10]])))
If I don't want to download the dataset, is it possible to get the data ready directly from pandas dataframes using read_csv?

I would suggest you to first convert Pandas DataFrame into Spark DataFrame. You can use sqlContext.createDataFrame method to do that.
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand-training-true.data', names=['S1','C1','S2','C2','S3','C3','S4','C4','S5','C5','class'])
s_df = spark.createDataFrame(df)
Now you can use this Dataframe to get your training dataset.
train_dataset = s_df.rdd.map(lambda x: LabeledPoint(x[10], x[:10])).collect()

Related

Merging data from two dataframes for training

I have the following two dataframe :
data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/digit-recognition/train_data.csv')
data_custom = pd.read_csv('/content/drive/My Drive/Colab Notebooks/digit-recognition/custom-data.csv',header=None)
I want to train my KNN model on the combined data. Is there a way to combine these two dataframes. The normal merge may not work
directly as the column headers are present in one and not the other. Althought there structure is exam same.
custom-data.csv file
https://drive.google.com/file/d/1Qj-zfWoaYbMMbEin1K0dFbFHfDFr_t85/view?usp=sharing
train_data.csv file
https://drive.google.com/file/d/1yDmKBt-boMfaF5LK2SN7MM8LeUZ1p6vD/view?usp=sharing
final_data = pd.concat([data, data_custom]) produces the following output
Here is the screenshot of custom-data.csv file
And here is the screenshot of train_data.csv file top rows -
You could try pd.concat:
data.columns = data_custom.columns
final_data = pd.concat([data, data_custom])
You could definitely use concat as said by #U12-Forward
Otherwise you should take a look at this page where it shows the difference between concat, merge, join and append, depending on what you need to do.
import pandas as pd
data = pd.read_csv("train_data.csv")
data_custom = pd.read_csv('custom-data.csv', header=None)
# combining data_custom df as row to the end of data df:
entry = data_custom.iloc[0].values
data.loc[len(data)] = entry
data shape at start: (42000, 785)
data shape after combining: (42001, 785)
The custom data is combined as another row.

How to sort dataframe by value in Pandas?

I have a data set in csv that I read with pd.read_csv. I want to sort the esxisting data by descanding value.
my code is this:
dataset = pd.read_csv('Kripto.csv')
sorted = dataset.sort_values(by = "Freq1", ascending=False)
x = dataset.iloc[:, :].values
and my data set (print(dataset)) is this :
Letter;Freq1
0 A;0.0817
1 B;0.0150
2 C;0.0278
3 D;0.0425
4 E;0.1270
when i want to use this code:
sorted = dataset.sort_values(by = "Freq1", ascending=False)
python gives me an error and says KeyError: 'Freq1'
I know that "Freq1" is not the name of the column but ı have no idea how to assing a name
Your csv file has " ; " as separator, you need to indicate on the read_csv method:
import pandas as pd
dataset = pd.read_csv('your.csv', sep=';')
And that's all you need to do
Your CSV file uses semi-colons to separate values. Since Pandas by defaults expects comma's, use
dataset = pd.read_csv('Kripto.csv', sep=';')
instead.
You should also use the sorted dataset to get your values in sorted order, instead of dataset, since the latter will remain unsorted:
x = sorted.iloc[:, :].values

how to display/view `sklearn.utils.Bunch` data set?

I am going through a tutorial that uses sklearn.utils.Bunch as a data set:
cal_housing = fetch_california_housing()
I'm running this on a Databricks notebook.
I've read through the documentation that I can find like
https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html and search engines aren't yielding anything useful.
but how can I see/view what's in this data set?
If I understood correctly, you can convert it to pandas dataframe:
df = california_housing.fetch_california_housing()
calf_hous_df = pd.DataFrame(data= df.data, columns=df.feature_names)
calf_hous_df.sample(4)
Moreover, you can see attributes:
df.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
the sklearn.utils.Bunch data can be viewed by using pandas to make it into a dataframe:
data = pd.DataFrame(cal_housing.data,columns=cal_housing.feature_names)
data

SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index

I keep getting the following error on Databricks:
SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index, got You are trying to use pandas function .iloc[..., ...], use spark function select, where
this is my code:
import re
import nltk
import heapq
corpus = []
for i in range(0, len(Y)):
describe = re.sub('[^a-zA-Z]', ' ', Y.iloc[i, 0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
The code works fine in Spyder, but not in databricks.
I tried to reproduce the same issue as yours successfully, as the code and figure below.
import numpy as np
import pandas as pd
import databricks.koalas as ks
dates = pd.date_range('20130101', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df = ks.from_pandas(pdf)
print(pdf.iloc[0,0])
print(df.iloc[0,0])
Due to lack of the necessary description of your variable Y, I guess Y is a dataframe, but the differences are pandas dataframe on local Spyder, Koalas dataframe in databricks.
According to the Koalas document for databricks.koalas.DataFrame.iloc, it does not support the operation iloc(int, int) for a Koalas dataframe.
So if you want to do some operation for the first column value of each row in databricks, there are two solutions as below.
Make sure Y is a pandas dataframe in the same script of your databricks.
Y must be a Koalas dataframe as you want, please try to the code as below.
# Here, `Y` is a Koalas dataframe
for row in Y.iterrows():
describe = re.sub('[^a-zA-Z]', ' ', row[1][0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
As you can see my sample code and result below, the function iterrows can help to get get the first column value of each row.

How to transform vector into array for Frequent Pattern Analysis

I am applying a frequent pattern analysis and need some help with the input type.
To start with, I use stringindexer to transform my categorial variables into numbers.
Afterwards, I create a unique number for each categorical value like this:
add_100=udf(lambda x:x+100,returnType=FloatType())
add_1000=udf(lambda x:x+1000,returnType=FloatType())
df = df.select('cat_var_1', add_1000('cat_var_2').alias('cat_var_2_final'), add_10000('cat_var_3').alias('cat_var_3_final'))
My next step is to create a vector with the features:
featuresCreator = ft.VectorAssembler(inputCols=[col for col in features], outputCol='features')
df=featuresCreator.transform(df)
Lastly, I try to fit my model:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="features", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
And get this error:
u'requirement failed: The input column must be ArrayType, but got
org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7.
So, the question is, how can I transform my vector into array? Or, are there other ways for me to solve this issue?
FPGrowth takes an Array instead of a Vector. Since VectorAssembler will give you a vector as output, a possible and simple solution would be to convert that output to an array using an UDF.
to_array = udf(lambda x: x.toArray(), ArrayType(DoubleType()))
df = df.withColumn('features', to_array('features'))
An better solution would be to do everything at once, i.e. not using a VectorAssembler at all. This has the benefit of not needing an UDF at all and is thus much faster. This makes use of the array function built into pyspark.
from pyspark.sql import functions as F
df2 = df.withColumn('features', F.array('cat_var_1', 'cat_var_2', 'cat_var_3'))
I think, you don't need udf for creating unique number.Alternatively you can use withColumn directly like,
df = df.withColumn('cat_var_2_final',df['cat_var_2']+100).withColumn('cat_var_3_final',df['cat_var_3']+1000)
And also, if you are going this data only for FPGrowth model, we can also skip the vectorassembler and directly create Array feature using an udf as,
udf1 = udf(lambda c1,c2,c3 : (c1,c2,c3),ArrayType(IntegerType()))
df = df.withColumn('features',udf1(df['cat_var_1'],df['cat_var_2_final'],df['cat_var_3_final']))

Categories

Resources