Basically, I have csv_events in my S3_bucket(s3://csv_events/user=111/year=2020/month=07/no.of.csv files). I want to convert these events into parquet format and want to store the results into another S3_bucket(s3://parquet_events/user=111/year=2020/month=07/parquet_files).
My Approach:
First, i created a glue-crawler to crawl csv_events and created a athena_table(csv_events_table). Then Created a Glue-job, which will take csv_events_table as a input and convert those events into parquet and stored the results into S3. Finally, created another table for this parquet_events(parquet_events_table).
My approach is similar to this: https://www.powerupcloud.com/how-to-convert-historical-data-into-parquet-format-with-date-partitioning/
it is working fine, but i end up with having two athena_tables(csv_events_table,parquet_events_table).
Is there any way to directly access s3 data into glue job and convert it into parquet format? So that i will have only one athena_table(parquet_events_table)
Please let me know.
Regards
-Siva
Absolutely, you can create the spark session (my preference) or you can use glue context to read data from s3 directly
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
df = spark.read.csv(s3path)
Related
Use case is to append a column to a Parquet dataset and then re-write efficiently at the same location. Here is a minimal example.
Create a pandas DataFrame and write as a partitioned Parquet dataset.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'value': [0,1,2,3,4,5,6,7,8]})
path = r'c:/data.parquet'
df.to_parquet(path=path, engine='pyarrow', compression='snappy', index=False, partition_cols=['id'], flavor='spark')
Then load the Parquet dataset as a pyspark view and create a modified dataset as a pyspark DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.parquet(path).createTempView('data')
sf = spark.sql(f"""SELECT id, value, 0 AS segment FROM data""")
At this point sf data is same as df data but with an additional segment column of all zeros. I would like to efficiently overwrite the existing Parquet dataset at path with sf as a Parquet dataset in the same location. Below is what does not work. Also prefer not to write sf to a new location, delete old Parquet dataset, and rename as does not seem efficient.
# saves existing data and new data
sf.write.partitionBy('id').mode('append').parquet(path)
# immediately deletes existing data then crashes
sf.write.partitionBy('id').mode('overwrite').parquet(path)
My answer in short: you shouldn't :\
One principle of bigdata (and spark is for bigdata), is to never override stuff. Sure, there exist the .mode('overwrite'), but this is not a correct usage.
My guesses as to why it could (should) fail:
you add a column, so written dataset have a different format than the one currently stored there. This can create a schema confusion
you override the input data while processing. So spark read some lines, process them and override the input files. But then those files are still the inputs for other lines to process.
What I usually do in such situation is to create another dataset, and when there is no reason to keep to old one (i.e. when the processing is completely finished), clean it. To remove files, you can check this post on how to delete hdfs files. It should work for all files accessible by spark. However it is in scala, so I'm not sure if it can be adapted to pyspark.
Note that efficiency is not a good reason to override, it does more work that
simply writing.
I have run a query using pyathena, and have created a pandas dataframe. Is there a way to write the pandas dataframe to AWS athena database directly?
Like data.to_sql for MYSQL database.
Sharing a example of dataframe code below for reference need to write into AWS athena database:
data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})
Another modern (as for February 2020) way to achieve this goal is to use aws-data-wrangler library. It's authomating many routine (and sometimes annoying) tasks in data processing.
Combining the case from the question the code would look like below:
import pandas as pd
import awswrangler as wr
data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})
# Typical Pandas, Numpy or Pyarrow transformation HERE!
wr.pandas.to_parquet( # Storing the data and metadata to Data Lake
dataframe=data,
database="database",
path="s3://your-s3-bucket/path/to/new/table",
partition_cols=["name"],
)
This is amazingly helpful, because aws-data-wrangler knows to parse table name from the path (but you can provide table name in the parameter) and define proper types in Glue catalog according to the dataframe.
It also helpful for querying the data with Athena directly to pandas dataframe:
df = wr.pandas.read_table(database="dataase", table="table")
All the process will be fast and convenient.
Storage for AWS Athena is S3. And it reads data from S3 files only. It was not possible earlier to write the data directly to Athena database like any other database.
It was missing support support for insert into ....
As workaround, users could have done following steps to make it work.
1. You need to write the pandas output to a file,
2. Save the file to S3 location, from where the AWS Athena is reading.
I hope it gives you some pointers.
Update on 05/01/2020.
On Sep 19, 2019, AWS has announced support for insert to Athena, has made one of statement in above answer incorrect, though above solution that I have provide will still work, but with AWS announcement has added another possible solution going forward.
As AWS Documentation suggests, this feature will allow you send insert statements and Athena will write data back to new file in source table S3 location. So essentially, AWS has resolved your headache of writing data to back S3 files.
Just a note, Athena will write inserted data into separate files.
Here goes the documentation.
The top answer at the time of writing uses an older version of the API, which no longer works.
The documentation now outlines this round trip.
import awswrangler as wr
import pandas as pd
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")
One option is to use:
pandas_df.to_parquet(file, engine="pyarrow)
To save it first to a temporal file in parquet format. For this you need to install pyarrow dependency. Once this file is saved locally, you can push it to S3 using the aws sdk for python.
A new table in Athena can now be created by executig the following queries:
CREATE EXTERNAL TABLE IF NOT EXISTS 'your_new_table'
(col1 type1, col2 type2)
PARTITIONED BY (col_partitions_if_neccesary)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3 location of your parquet file'
tblproperties ("parquet.compression"="snappy");
Another option is to use pyathena for that. Taking the example from their official documentation:
import pandas as pd
from urllib.parse import quote_plus
from sqlalchemy import create_engine
conn_str = "awsathena+rest://:#athena.{region_name}.amazonaws.com:443/"\
"{schema_name}?s3_staging_dir={s3_staging_dir}&s3_dir={s3_dir}&compression=snappy"
engine = create_engine(conn_str.format(
region_name="us-west-2",
schema_name="YOUR_SCHEMA",
s3_staging_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/"),
s3_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/")))
df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
df.to_sql("YOUR_TABLE", engine, schema="YOUR_SCHEMA", index=False, if_exists="replace", method="multi")
In this case, the dependency sqlalchemy is needed.
I'm currently using PyHive (Python3.6) to read data to a server that exists outside the Hive cluster and then use Python to perform analysis.
After performing analysis I would like to write data back to the Hive server.
In searching for a solution, most posts deal with using PySpark. In the long term we will set up our system to use PySpark. However, in the short term is there a way to easily write data directly to a Hive table using Python from a server outside of the cluster?
Thanks for your help!
You could use the subprocess module.
The following function will work for data you've already saved locally. For example, if you save a dataframe to csv, you an pass the name of the csv into save_to_hdfs, and it will throw it in hdfs. I'm sure there's a way to throw the dataframe up directly, but this should get you started.
Here's an example function for saving a local object, output, to user/<your_name>/<output_name> in hdfs.
import os
from subprocess import PIPE, Popen
def save_to_hdfs(output):
"""
Save a file in local scope to hdfs.
Note, this performs a forced put - any file with the same name will be
overwritten.
"""
hdfs_path = os.path.join(os.sep, 'user', '<your_name>', output)
put = Popen(["hadoop", "fs", "-put", "-f", output, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()
# example
df = pd.DataFrame(...)
output_file = 'yourdata.csv'
dataframe.to_csv(output_file)
save_to_hdfs(output_file)
# remove locally created file (so it doesn't pollute nodes)
os.remove(output_file)
In which format you want to write data to hive? Parquet/Avro/Binary or simple csv/text format?
Depending upon your choice of serde you use while creating hive table, different python libraries can be used to first convert your dataframe to respective serde, store the file locally and then you can use something like save_to_hdfs (as answered by #Jared Wilber below) to move that file into hdfs hive table location path.
When a hive table is created (default or external table), it reads/stores its data from a specific HDFS location (default or provided location). And this hdfs location can be directly accessed to modify data. Some things to remember if manually updating data in hive tables- SERDE, PARTITIONS, ROW FORMAT DELIMITED etc.
Some helpful serde libraries in python:
Parquet: https://fastparquet.readthedocs.io/en/latest/
Avro:https://pypi.org/project/fastavro/
It took some digging but I was able to find a method using sqlalchemy to create a hive table directly from a pandas dataframe.
from sqlalchemy import create_engine
#Input Information
host = 'username#local-host'
port = 10000
schema = 'hive_schema'
table = 'new_table'
#Execution
engine = create_engine(f'hive://{host}:{port}/{schema}')
engine.execute('CREATE TABLE ' + table + ' (col1 col1-type, col2 col2-type)')
Data.to_sql(name=table, con=engine, if_exists='append')
You can write back.
Convert data of df into such format like you are inserting multiple rows into the table at once eg.. insert into table values (first row of dataframe comma separated ), (second row), (third row).... so on;
thus you can insert.
bundle=df.assign(col='('+df[df.col[0]] + ','+df[df.col[1]] +...+df[df.col[n]]+')'+',').col.str.cat(' ')[:-1]
con.cursor().execute('insert into table table_name values'+ bundle)
and you are done.
I'm new to Azure Databricks so I am having a hard time importing JSON data and converting it to CSV using Azure Databricks even after reading the documentation.
After converting JSON to CSV, I need to combine it with another CSV data that has a mutual column.
Any help would be really appreciated. Thank you
Are you looking to join on the mutual column? If so you can do something like this:
dfjson = spark.read.json(/path/to/json)
dfcsv = spark.read.csv(/path/to/csv)
dfCombined = dfjson.join(dfcsv, dfjson.mutualCol == dfcsv.mutualCol, joinType)
dfCombined.save.format(someFormat).write(/path/to/output)
We have our own proprietary format for storing polygon and shapes in an image. I would like to use Spark to process this format. Is it possible to create my own reader in SparkContext to read the proprietary format and populate RDDs? I would like to create a derived class of existing RDDs which would be populated by my reader in SparkContext. I would like to do this in Python. Any suggestions or links is appreciated.
You should be able to simply read the data and convert it to an RDD using your Spark Context. You can then action on the data using Spark.
Example:
val sc = new SparkContext(sparkConf)
val result : RDD[MyCustomObject] =
sc
.parallelize(Source.fromFile("/tmp/DataFile.csv")
.getLines()
.drop(1)
.map(x => MyCustomObject(x))
This applies to any format of data - although if you're able to read from HDFS, Cassandra, etc. the data will be read via distributed read and presented as an RDD.