How to preprocess csv data for Spark 2.0 clustering?

How to preprocess csv data for Spark 2.0 clustering? - python

I have a very simple csv file that looks like this:
time,is_boy,is_girl
135,1,0
136,0,1
137,0,1
I have this csv file sitting in a Hive table also, where all the values have been created as doubles in the table.
Behind the scenes, this table is actually enormous, and has an enormous number of rows, so I have chosen to use Spark 2 to solve this problem.
I would like to use this clustering library, with Python:
https://spark.apache.org/docs/2.2.0/ml-clustering.html
If anyone knows how to load this data, either directly from the csv or by using some Spark SQL magic, and preprocess it correctly, using Python, so that it can be passed into the kmeans fit() method and calculate a model, I would be very grateful. I also think it would be useful for others as I haven't found an example for csvs and for this library yet.

The fit method just takes a vector / Dataframe
spark.read().csv or spark.sql both return you a Dataframe.
However you want to preprocess your data, read over the Dataframe documentation before getting into the MlLib / Kmeans examples

So I guessed enough times and finally solved this, there were quite a few weird things I had to do to get it to work, so I feel it's worth sharing:
I created a simple csv like so:
time,is_boy,is_girl
123,1.0,0.0
132,1.0,0.0
135,0.0,1.0
139,0.0,1.0
140,1.0,0.0
Then I created a hive table, executing this query in hue:
CREATE EXTERNAL TABLE pollab02.experiment_raw(
`time` double,
`is_boy` double,
`is_girl` double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with
serdeproperties( 'separatorChar' = ',' )
STORED AS TEXTFILE LOCATION "/user/me/hive/experiment"
TBLPROPERTIES ("skip.header.line.count"="1", "skip.footer.line.count"="0")
Then my pyspark script was as follows:
(I'm assuming a SparkSession has been created with the name "spark")
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
raw_data = spark.sql("select * from dbname.experiment_raw")
#filter out row of null values that were added for some reason
raw_data_filtered=raw_data.filter(raw_data.time>-1)
#convert rows of strings to doubles for kmeans:
data=raw_data_filtered.select([col(c).cast("double") for c in raw_data_filtered.columns])
cols = data.columns
#Merge data frame with column called features, that contains all data as a vector in each row
vectorAss = VectorAssembler(inputCols=cols, outputCol="features")
vdf=vectorAss.transform(data)
kmeans = KMeans(k=2, maxIter=10, seed=1)
model = kmeans.fit(vdf)
and the rest is history. I haven't done best best practices here. We could maybe drop some columns that we don't need from the vdf DataFrame to save space and improve performance, but this works.

Related

overwrite existing Parquet dataset with modified PySpark DataFrame

Use case is to append a column to a Parquet dataset and then re-write efficiently at the same location. Here is a minimal example.
Create a pandas DataFrame and write as a partitioned Parquet dataset.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'value': [0,1,2,3,4,5,6,7,8]})
path = r'c:/data.parquet'
df.to_parquet(path=path, engine='pyarrow', compression='snappy', index=False, partition_cols=['id'], flavor='spark')
Then load the Parquet dataset as a pyspark view and create a modified dataset as a pyspark DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.parquet(path).createTempView('data')
sf = spark.sql(f"""SELECT id, value, 0 AS segment FROM data""")
At this point sf data is same as df data but with an additional segment column of all zeros. I would like to efficiently overwrite the existing Parquet dataset at path with sf as a Parquet dataset in the same location. Below is what does not work. Also prefer not to write sf to a new location, delete old Parquet dataset, and rename as does not seem efficient.
# saves existing data and new data
sf.write.partitionBy('id').mode('append').parquet(path)
# immediately deletes existing data then crashes
sf.write.partitionBy('id').mode('overwrite').parquet(path)

My answer in short: you shouldn't :\
One principle of bigdata (and spark is for bigdata), is to never override stuff. Sure, there exist the .mode('overwrite'), but this is not a correct usage.
My guesses as to why it could (should) fail:
you add a column, so written dataset have a different format than the one currently stored there. This can create a schema confusion
you override the input data while processing. So spark read some lines, process them and override the input files. But then those files are still the inputs for other lines to process.
What I usually do in such situation is to create another dataset, and when there is no reason to keep to old one (i.e. when the processing is completely finished), clean it. To remove files, you can check this post on how to delete hdfs files. It should work for all files accessible by spark. However it is in scala, so I'm not sure if it can be adapted to pyspark.
Note that efficiency is not a good reason to override, it does more work that
simply writing.

Using PySpark to efficiently combine many small csv files (130,000 with 2 columns in each) into one large frame

This is another follow up to an earlier question I posted How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?
I have the following dataset https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip
In it, there's a list of files (around 130,000). In the main directory with their sub-directories listed, so in there the first cell might be A/AAAAA, and the file would be located at /data/A/AAAAA.csv
The files are all with a similar format, the first column is called DATE and the second column is a series which are all named VALUE. So first of all, the VALUE column name needs to be renamed to the file name in each csv file. Second, the frames need to be full outer joined with each other with the DATE as the main index. Third, I want to save the file and be able to load and manipulate it. The file should be around N rows (number of dates) X 130,001 roughly.
I am trying to full outer join all the files into a single dataframe, I previously tried pandas but ran out of memory when trying to concat the list of files and someone recommended that I try to use PySpark instead.
In a previous post I was told that I could do this:
df = spark.read.csv("/kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv", "date DATE, value DOUBLE")
But all the columns are named value and the frame just becomes two columns, the first column is DATE and second column is VALUE, it loads quite fast, around 38 seconds and around 3.8 million values by 2 columns, so I know that it's not doing the full outer join, it's appending the files row wise.
So I tried the following code:
import pandas as pd
import time
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col
from pyspark.sql import DataFrame
from pyspark.sql.types import *
filelist = pd.read_excel("/kaggle/input/list/BF_csv_2.xlsx") #list of filenames
firstname = min(filelist.File)
length = len(filelist.File)
dff = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + firstname, inferSchema = True, header = True).withColumnRenamed("VALUE",firstname) #read file and changes name of column to filename
for row in filelist.File.items():
if row == firstname:
continue
print (row[1],length,end='', flush=True)
df = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + row[1], inferSchema = True, header = True).withColumnRenamed("VALUE",row[1][:-4])
#df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
dff = dff.join(df, ['DATE'], how='full')
length -= 1
dff.write.save('/kaggle/working/whatever', format='parquet', mode='overwrite')
So to test it, I try to load the the df.show() function after 3 columns are merged and it's quite fast. But, when I try around 25 columns, it takes around 2 minutes. When I try 500 columns it's next to impossible.
I don't think I'm doing it right. The formatting and everything is correct. But why is it taking so long? How can I use PySpark properly? Are there any better libraries to achieve what I need?

Spark doesn't do anything magical compared to other software. The strength of spark is parallel processing. Most of the times that means you can use multiple machines to do the work. If you are running spark locally you may have the same issues you did when using pandas.
That being said, there might be a way for you to run it locally using Spark because it can spill to disk under certain conditions and does not need to have everything in memory.
I'm not verse in PySpark, but the approach I'd take is:
load all the files using like you did /kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv
Use the function from pyspark.sql.functions import input_file_name that allows you to get the path for each record in your DF (df.select("date", "value", input_file_name().as("filename")) or similar)
Parse the path into a format that I'd like to have as a column (eg. extract filename)
the schema should look like date, value, filename at this step
use the PySpark equivalent of df.groupBy("date").pivot("filename").agg(first("value")). Note: I used first() because I think you have 1 or 0 records possible
Also try: setting the number of partitions to be equal to number of dates you got
If you want output as a single file, do not forget to repartition(1) before df.write. This step might be problematic depending on data size. You do not need to do this if you plan to keep using Spark for your work as you could load the data using the same approach as in step 1 (/new_result_data/*.csv)

Save dictionary as a pyspark Dataframe and load it - Python, Databricks

I have a dictionary as follows:
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it. Furthermore, I would like to know how to retrieve it and have it in its original form again.
I have tried doing the following:
from itertools import zip_longest
column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
and
column_names, data = zip(*dict_brands.items())
spark.createDataFrame(zip(*data), column_names).show()
However, I get the following error:
zip_longest argument #10342 must support iteration
I also do not know how to reload it or upload it. I tried with a sample dataframe (not the same one), as follows:
df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')
And the error is:
Attribute name "my_column" contains invalid character(s)
among " ,;{}()\n\t=". Please use alias to rename it.
Finally, in order to obtain it, I thought about:
my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe
and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.
I would also like to know the computational cost for the solutions, since the actual dataset is very large.

Here is my sample code for realizing your needs step by step.
Convert a dictionary to a Pandas dataframe
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
import pandas as pd
pdf = pd.DataFrame(my_dict)
Convert a Pandas dataframe to a PySpark dataframe
df = spark.createDataFrame(pdf)
To save a PySpark dataframe to a file using parquet format. Format tfrecords is not supported at here.
df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')
To load the saved file above as a PySpark dataframe.
df2 = spark.read.format("parquet").load('/data/tmp/my_df')
To convet a PySpark dataframe to a dictionary.
my_dict2 = df2.toPandas().to_dict()
The computational cost of these code above is depended on the memory usage for your actual dataset.

If you save a DataFrame to disk, will Spark load that data if you use that DataFrame lower in the script?

If you load some data, compute a DataFrame, write that to disk and then use the DataFrame later... assuming it isn't still cached in RAM (lets say there wasn't enough), would Spark be smart enough to load the data from disk rather than recompute the DataFrame from the original data?
For example:
df1 = spark.read.parquet('data/df1.parquet')
df2 = spark.read.parquet('data/df2.parquet')
joined = df1.join(df2, df1.id == df2.id)
joined.write.parquet('data/joined.parquet')
computed = joined.select('id').withColummn('double_total', 2 * joined.total)
computed.write.parquet('data/computed.parquet')
Under the right circumstances, when we store computed, will it load the joined DataFrame from data/joined.parquet or will it always re-compute by loading/joining df1/df2 if it isn't currently caching joined?

The joined dataframe points to df1.join(df2, df1.id == df2.id). As far as I know the parquet writer will not cause any changes to that reference therefore in order to load the parquet data you need to construct a new Spark reader with spark.reader.parquet(...).
You can verify the above claim from the DataFrameWriter code (check parquet/save methods) which returns Unit and not modifying somehow the reference of the source dataframe. Finally to answer your question in the above example the joined dataframe will be calculated once for joined.write.parquet('data/joined.parquet') and once for computed.write.parquet('data/computed.parquet')

Hive UDF with Python

I'm new to python, pandas, and hive and would definitely appreciate some tips.
I have the python code below, which I would like to turn into a UDF in hive. Only instead of taking a csv as the input, doing the transformations and then exporting another csv, I would like to take a hive table as the input, and then export the results as a new hive table containing the transformed data.
Python Code:
import pandas as pd
data = pd.read_csv('Input.csv')
df = data
df = df.set_index(['Field1','Field2'])
Dummies=pd.get_dummies(df['Field3']).reset_index()
df2=Dummies.drop_duplicates()
df3=df2.groupby(['Field1','Field2']).sum()
df3.to_csv('Output.csv')

You can make use of the TRANSFORM function to make use of a UDF written in Python. The detailed steps are outlined here and here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to preprocess csv data for Spark 2.0 clustering? - python

The fit method just takes a vector / Dataframe spark.read().csv or spark.sql both return you a Dataframe. However you want to preprocess your data, read over the Dataframe documentation before getting into the MlLib / Kmeans examples

Related

overwrite existing Parquet dataset with modified PySpark DataFrame

Using PySpark to efficiently combine many small csv files (130,000 with 2 columns in each) into one large frame

Save dictionary as a pyspark Dataframe and load it - Python, Databricks

If you save a DataFrame to disk, will Spark load that data if you use that DataFrame lower in the script?

Hive UDF with Python

Categories

Resources