Use case is to append a column to a Parquet dataset and then re-write efficiently at the same location. Here is a minimal example.
Create a pandas DataFrame and write as a partitioned Parquet dataset.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'value': [0,1,2,3,4,5,6,7,8]})
path = r'c:/data.parquet'
df.to_parquet(path=path, engine='pyarrow', compression='snappy', index=False, partition_cols=['id'], flavor='spark')
Then load the Parquet dataset as a pyspark view and create a modified dataset as a pyspark DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.parquet(path).createTempView('data')
sf = spark.sql(f"""SELECT id, value, 0 AS segment FROM data""")
At this point sf data is same as df data but with an additional segment column of all zeros. I would like to efficiently overwrite the existing Parquet dataset at path with sf as a Parquet dataset in the same location. Below is what does not work. Also prefer not to write sf to a new location, delete old Parquet dataset, and rename as does not seem efficient.
# saves existing data and new data
sf.write.partitionBy('id').mode('append').parquet(path)
# immediately deletes existing data then crashes
sf.write.partitionBy('id').mode('overwrite').parquet(path)
My answer in short: you shouldn't :\
One principle of bigdata (and spark is for bigdata), is to never override stuff. Sure, there exist the .mode('overwrite'), but this is not a correct usage.
My guesses as to why it could (should) fail:
you add a column, so written dataset have a different format than the one currently stored there. This can create a schema confusion
you override the input data while processing. So spark read some lines, process them and override the input files. But then those files are still the inputs for other lines to process.
What I usually do in such situation is to create another dataset, and when there is no reason to keep to old one (i.e. when the processing is completely finished), clean it. To remove files, you can check this post on how to delete hdfs files. It should work for all files accessible by spark. However it is in scala, so I'm not sure if it can be adapted to pyspark.
Note that efficiency is not a good reason to override, it does more work that
simply writing.
Related
I have a use case where in I am reading data from a source into a dataframe, doing a groupBy on a field and essentially breaking that dataframe into an array of dataframes.
My target state is to have all these dataframes be written as individual CSVs files in S3 ( CSV being they need to be downloaded by the client and need to be human readable ).
What's the best way of going about this?
I used this to split df into df_array : df_array = [(df.where(df[column_name] == i),i) for i in distinct_values]
And df.toPandas().to_csv(output_path +'.csv',index=False) individually on dataframes to convert to CSV files - but the challenges being faced in this approach are
My understanding is since I require a single CSV file per my grouping field, to_csv will bring data from all worker nodes to the driver and and may give driver OOM issue.
I am unable to use python multiprocessing to write the individual dataframes to S3 since data is distributed on worker nodes and gives me an error : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
No space left on device.
The pipeline is pretty slow as well, what is the better way I can approach this use case?
[EDIT]
I want to control the name of the CSV file which gets created as well. Target state is 1 CSV file per my group-by field ( let's call that Name ) so if there are 10 different Names in my initial df, output will be 10 CSV files each with the title as Name1.csv, Name2.csv and so on
As you're using pyspark, why don't you use the repartition and partitionBy to achieve your goal?
df.repartition(1).write.partitionBy('grouping_field1', 'grouping_field2', ...).save('/path/to/save', format='csv')
I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer.
I can read them all and subsequently convert to a pandas dataframe:
files = glob.glob("data-**.parquet")
ds = pq.ParquetDataset(
files,
metadata_nthreads=64,
).read_table(use_threads=True)
df = ds.to_pandas()
This works just fine. What it would like to have is an additional column in the final data frame, indicating from which file the data is originating.
As far as I understand, the ds data is partitioned, with one partition per file. So it would be a matter of including the partition key in the data frame.
Is this feasible?
The partition key is, at the moment, included in the dataframe. However, all existing partitioning schemes use directory names for the key. So if your data was /N/data.parquet or /batch=N/data.parquet this will happen (you will need to supply a partitioning object when you read the dataset).
There is no way today (in pyarrow) to get the filename in the returned results.
As sort of follow on to my previous question [1], is there a way to open a hdf5 dataset in vaex, perform operations and then store the results to the same dataset?
I tried the following:
import vaex as vx
vxframe = vx.open('somedata.hdf5')
vxframe = some_transformation(vxframe)
vxframe.export_hdf5('somedata.hdf5')
This results in the error OSError: Unable to create file (unable to truncate a file which is already open), so h5py can't write to the file while it is open. Is there another workflow to achieve this? I can write to another file as a workaround, but that seems quite inefficient as (I imagine) it has to copy all the data that has not changed as well.
[1] Convert large hdf5 dataset written via pandas/pytables to vaex
Copying to a new file would not be less efficient than writing to itself (at least not for this example), since it will have to write the same amount of bytes. I also would not recommend it, since if you make a mistake, you will mess up your data.
Exporting data is actually quite efficient, but even better, you can also choose to just export the columns you want:
df = vaex.open('somedata.hdf5')
df2 = some_transformation(df)
df2[['new_column1', 'new_columns2']].export('somedata_extra.hdf5')
...
# next time
df = vaex.open('somedata.hdf5')
df2 = vaex.open('somedata_extra.hdf5')
df = df.join(df2) # merge without column name will merge on rows basis
We used this approach alot, to create auxiliary datasets on disk that were precomputed. Joining them back (on row bases) is instant, it does not take any time or memory.
If you load some data, compute a DataFrame, write that to disk and then use the DataFrame later... assuming it isn't still cached in RAM (lets say there wasn't enough), would Spark be smart enough to load the data from disk rather than recompute the DataFrame from the original data?
For example:
df1 = spark.read.parquet('data/df1.parquet')
df2 = spark.read.parquet('data/df2.parquet')
joined = df1.join(df2, df1.id == df2.id)
joined.write.parquet('data/joined.parquet')
computed = joined.select('id').withColummn('double_total', 2 * joined.total)
computed.write.parquet('data/computed.parquet')
Under the right circumstances, when we store computed, will it load the joined DataFrame from data/joined.parquet or will it always re-compute by loading/joining df1/df2 if it isn't currently caching joined?
The joined dataframe points to df1.join(df2, df1.id == df2.id). As far as I know the parquet writer will not cause any changes to that reference therefore in order to load the parquet data you need to construct a new Spark reader with spark.reader.parquet(...).
You can verify the above claim from the DataFrameWriter code (check parquet/save methods) which returns Unit and not modifying somehow the reference of the source dataframe. Finally to answer your question in the above example the joined dataframe will be calculated once for joined.write.parquet('data/joined.parquet') and once for computed.write.parquet('data/computed.parquet')
I have a very simple csv file that looks like this:
time,is_boy,is_girl
135,1,0
136,0,1
137,0,1
I have this csv file sitting in a Hive table also, where all the values have been created as doubles in the table.
Behind the scenes, this table is actually enormous, and has an enormous number of rows, so I have chosen to use Spark 2 to solve this problem.
I would like to use this clustering library, with Python:
https://spark.apache.org/docs/2.2.0/ml-clustering.html
If anyone knows how to load this data, either directly from the csv or by using some Spark SQL magic, and preprocess it correctly, using Python, so that it can be passed into the kmeans fit() method and calculate a model, I would be very grateful. I also think it would be useful for others as I haven't found an example for csvs and for this library yet.
The fit method just takes a vector / Dataframe
spark.read().csv or spark.sql both return you a Dataframe.
However you want to preprocess your data, read over the Dataframe documentation before getting into the MlLib / Kmeans examples
So I guessed enough times and finally solved this, there were quite a few weird things I had to do to get it to work, so I feel it's worth sharing:
I created a simple csv like so:
time,is_boy,is_girl
123,1.0,0.0
132,1.0,0.0
135,0.0,1.0
139,0.0,1.0
140,1.0,0.0
Then I created a hive table, executing this query in hue:
CREATE EXTERNAL TABLE pollab02.experiment_raw(
`time` double,
`is_boy` double,
`is_girl` double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with
serdeproperties( 'separatorChar' = ',' )
STORED AS TEXTFILE LOCATION "/user/me/hive/experiment"
TBLPROPERTIES ("skip.header.line.count"="1", "skip.footer.line.count"="0")
Then my pyspark script was as follows:
(I'm assuming a SparkSession has been created with the name "spark")
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
raw_data = spark.sql("select * from dbname.experiment_raw")
#filter out row of null values that were added for some reason
raw_data_filtered=raw_data.filter(raw_data.time>-1)
#convert rows of strings to doubles for kmeans:
data=raw_data_filtered.select([col(c).cast("double") for c in raw_data_filtered.columns])
cols = data.columns
#Merge data frame with column called features, that contains all data as a vector in each row
vectorAss = VectorAssembler(inputCols=cols, outputCol="features")
vdf=vectorAss.transform(data)
kmeans = KMeans(k=2, maxIter=10, seed=1)
model = kmeans.fit(vdf)
and the rest is history. I haven't done best best practices here. We could maybe drop some columns that we don't need from the vdf DataFrame to save space and improve performance, but this works.