We have our own proprietary format for storing polygon and shapes in an image. I would like to use Spark to process this format. Is it possible to create my own reader in SparkContext to read the proprietary format and populate RDDs? I would like to create a derived class of existing RDDs which would be populated by my reader in SparkContext. I would like to do this in Python. Any suggestions or links is appreciated.
You should be able to simply read the data and convert it to an RDD using your Spark Context. You can then action on the data using Spark.
Example:
val sc = new SparkContext(sparkConf)
val result : RDD[MyCustomObject] =
sc
.parallelize(Source.fromFile("/tmp/DataFile.csv")
.getLines()
.drop(1)
.map(x => MyCustomObject(x))
This applies to any format of data - although if you're able to read from HDFS, Cassandra, etc. the data will be read via distributed read and presented as an RDD.
Related
Use case is to append a column to a Parquet dataset and then re-write efficiently at the same location. Here is a minimal example.
Create a pandas DataFrame and write as a partitioned Parquet dataset.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'value': [0,1,2,3,4,5,6,7,8]})
path = r'c:/data.parquet'
df.to_parquet(path=path, engine='pyarrow', compression='snappy', index=False, partition_cols=['id'], flavor='spark')
Then load the Parquet dataset as a pyspark view and create a modified dataset as a pyspark DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.parquet(path).createTempView('data')
sf = spark.sql(f"""SELECT id, value, 0 AS segment FROM data""")
At this point sf data is same as df data but with an additional segment column of all zeros. I would like to efficiently overwrite the existing Parquet dataset at path with sf as a Parquet dataset in the same location. Below is what does not work. Also prefer not to write sf to a new location, delete old Parquet dataset, and rename as does not seem efficient.
# saves existing data and new data
sf.write.partitionBy('id').mode('append').parquet(path)
# immediately deletes existing data then crashes
sf.write.partitionBy('id').mode('overwrite').parquet(path)
My answer in short: you shouldn't :\
One principle of bigdata (and spark is for bigdata), is to never override stuff. Sure, there exist the .mode('overwrite'), but this is not a correct usage.
My guesses as to why it could (should) fail:
you add a column, so written dataset have a different format than the one currently stored there. This can create a schema confusion
you override the input data while processing. So spark read some lines, process them and override the input files. But then those files are still the inputs for other lines to process.
What I usually do in such situation is to create another dataset, and when there is no reason to keep to old one (i.e. when the processing is completely finished), clean it. To remove files, you can check this post on how to delete hdfs files. It should work for all files accessible by spark. However it is in scala, so I'm not sure if it can be adapted to pyspark.
Note that efficiency is not a good reason to override, it does more work that
simply writing.
Basically, I have csv_events in my S3_bucket(s3://csv_events/user=111/year=2020/month=07/no.of.csv files). I want to convert these events into parquet format and want to store the results into another S3_bucket(s3://parquet_events/user=111/year=2020/month=07/parquet_files).
My Approach:
First, i created a glue-crawler to crawl csv_events and created a athena_table(csv_events_table). Then Created a Glue-job, which will take csv_events_table as a input and convert those events into parquet and stored the results into S3. Finally, created another table for this parquet_events(parquet_events_table).
My approach is similar to this: https://www.powerupcloud.com/how-to-convert-historical-data-into-parquet-format-with-date-partitioning/
it is working fine, but i end up with having two athena_tables(csv_events_table,parquet_events_table).
Is there any way to directly access s3 data into glue job and convert it into parquet format? So that i will have only one athena_table(parquet_events_table)
Please let me know.
Regards
-Siva
Absolutely, you can create the spark session (my preference) or you can use glue context to read data from s3 directly
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
df = spark.read.csv(s3path)
I'm currently using PyHive (Python3.6) to read data to a server that exists outside the Hive cluster and then use Python to perform analysis.
After performing analysis I would like to write data back to the Hive server.
In searching for a solution, most posts deal with using PySpark. In the long term we will set up our system to use PySpark. However, in the short term is there a way to easily write data directly to a Hive table using Python from a server outside of the cluster?
Thanks for your help!
You could use the subprocess module.
The following function will work for data you've already saved locally. For example, if you save a dataframe to csv, you an pass the name of the csv into save_to_hdfs, and it will throw it in hdfs. I'm sure there's a way to throw the dataframe up directly, but this should get you started.
Here's an example function for saving a local object, output, to user/<your_name>/<output_name> in hdfs.
import os
from subprocess import PIPE, Popen
def save_to_hdfs(output):
"""
Save a file in local scope to hdfs.
Note, this performs a forced put - any file with the same name will be
overwritten.
"""
hdfs_path = os.path.join(os.sep, 'user', '<your_name>', output)
put = Popen(["hadoop", "fs", "-put", "-f", output, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()
# example
df = pd.DataFrame(...)
output_file = 'yourdata.csv'
dataframe.to_csv(output_file)
save_to_hdfs(output_file)
# remove locally created file (so it doesn't pollute nodes)
os.remove(output_file)
In which format you want to write data to hive? Parquet/Avro/Binary or simple csv/text format?
Depending upon your choice of serde you use while creating hive table, different python libraries can be used to first convert your dataframe to respective serde, store the file locally and then you can use something like save_to_hdfs (as answered by #Jared Wilber below) to move that file into hdfs hive table location path.
When a hive table is created (default or external table), it reads/stores its data from a specific HDFS location (default or provided location). And this hdfs location can be directly accessed to modify data. Some things to remember if manually updating data in hive tables- SERDE, PARTITIONS, ROW FORMAT DELIMITED etc.
Some helpful serde libraries in python:
Parquet: https://fastparquet.readthedocs.io/en/latest/
Avro:https://pypi.org/project/fastavro/
It took some digging but I was able to find a method using sqlalchemy to create a hive table directly from a pandas dataframe.
from sqlalchemy import create_engine
#Input Information
host = 'username#local-host'
port = 10000
schema = 'hive_schema'
table = 'new_table'
#Execution
engine = create_engine(f'hive://{host}:{port}/{schema}')
engine.execute('CREATE TABLE ' + table + ' (col1 col1-type, col2 col2-type)')
Data.to_sql(name=table, con=engine, if_exists='append')
You can write back.
Convert data of df into such format like you are inserting multiple rows into the table at once eg.. insert into table values (first row of dataframe comma separated ), (second row), (third row).... so on;
thus you can insert.
bundle=df.assign(col='('+df[df.col[0]] + ','+df[df.col[1]] +...+df[df.col[n]]+')'+',').col.str.cat(' ')[:-1]
con.cursor().execute('insert into table table_name values'+ bundle)
and you are done.
I have a huge list of GZip files which need to be converted to Parquet. Due to the compressing nature of GZip, this cannot be parallelized for one file.
However, since I have many, is there a relatively easy way to let every node do a part of the files? The files are on HDFS. I assume that I cannot use the RDD infrastructure for the writing of the Parquet files because this is all done on the driver as opposed to on the nodes themselves.
I could parallelize the list of file names, write a function that deals with the Parquets local and saves them back to HDFS. I wouldn't know how to do that. I feel like I'm missing something obvious, thanks!
This was marked as a duplicate question, this is not the case however. I am fully aware of the ability of Spark to read them in as RDDs without having to worry about the compression, my question is more about how to parallelize converting these files to structured Parquet files.
If I knew how to interact with Parquet files without Spark itself I could do something like this:
def convert_gzip_to_parquet(file_from, file_to):
gzipped_csv = read_gzip_file(file_from)
write_csv_to_parquet_on_hdfs(file_to)
# Filename RDD contains tuples with file_from and file_to
filenameRDD.map(lambda x: convert_gzip_to_parquet(x[0], x[1]))
That would allow me to parallelize this, however I don't know how to interact with HDFS and Parquet from a local environment. I want to know either:
1) How to do that
Or..
2) How to parallelize this process in a different way using PySpark
I would suggest one of the two following approaches (where in practice I have found the first one to give better results in terms of performance).
Write each Zip-File to a separate Parquet-File
Here you can use pyarrow to write a Parquet-File to HDFS:
def convert_gzip_to_parquet(file_from, file_to):
gzipped_csv = read_gzip_file(file_from)
pyarrow_table = to_pyarrow_table(gzipped_csv)
hdfs_client = pyarrow.HdfsClient()
with hdfs_client.open(file_to, "wb") as f:
pyarrow.parquet.write_table(pyarrow_table, f)
# Filename RDD contains tuples with file_from and file_to
filenameRDD.map(lambda x: convert_gzip_to_parquet(x[0], x[1]))
There are two ways to obtain pyarrow.Table objects:
either obtain it from a pandas DataFrame (in which case you can also use pandas' read_csv() function): pyarrow_table = pyarrow.Table.from_pandas(pandas_df)
or manually construct it using pyarrow.Table.from_arrays
For pyarrow to work with HDFS one needs to set several environment variables correctly, see here
Concatenate the rows from all Zip-Files into one Parquet-File
def get_rows_from_gzip(file_from):
rows = read_gzip_file(file_from)
return rows
# read the rows of each zip file into a Row object
rows_rdd = filenameRDD.map(lambda x: get_rows_from_gzip(x[0]))
# flatten list of lists
rows_rdd = rows_rdd.flatMap(lambda x: x)
# convert to DataFrame and write to Parquet
df = spark_session.create_DataFrame(rows_rdd)
df.write.parquet(file_to)
If you know the schema of the data in advance, passing in a schema object to create_DataFrame will speed up the creation of the DataFrame.
I'm new to python, pandas, and hive and would definitely appreciate some tips.
I have the python code below, which I would like to turn into a UDF in hive. Only instead of taking a csv as the input, doing the transformations and then exporting another csv, I would like to take a hive table as the input, and then export the results as a new hive table containing the transformed data.
Python Code:
import pandas as pd
data = pd.read_csv('Input.csv')
df = data
df = df.set_index(['Field1','Field2'])
Dummies=pd.get_dummies(df['Field3']).reset_index()
df2=Dummies.drop_duplicates()
df3=df2.groupby(['Field1','Field2']).sum()
df3.to_csv('Output.csv')
You can make use of the TRANSFORM function to make use of a UDF written in Python. The detailed steps are outlined here and here.