Write pandas dataframe into AWS athena database - python

I have run a query using pyathena, and have created a pandas dataframe. Is there a way to write the pandas dataframe to AWS athena database directly?
Like data.to_sql for MYSQL database.
Sharing a example of dataframe code below for reference need to write into AWS athena database:
data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})

Another modern (as for February 2020) way to achieve this goal is to use aws-data-wrangler library. It's authomating many routine (and sometimes annoying) tasks in data processing.
Combining the case from the question the code would look like below:
import pandas as pd
import awswrangler as wr
data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})
# Typical Pandas, Numpy or Pyarrow transformation HERE!
wr.pandas.to_parquet( # Storing the data and metadata to Data Lake
dataframe=data,
database="database",
path="s3://your-s3-bucket/path/to/new/table",
partition_cols=["name"],
)
This is amazingly helpful, because aws-data-wrangler knows to parse table name from the path (but you can provide table name in the parameter) and define proper types in Glue catalog according to the dataframe.
It also helpful for querying the data with Athena directly to pandas dataframe:
df = wr.pandas.read_table(database="dataase", table="table")
All the process will be fast and convenient.

Storage for AWS Athena is S3. And it reads data from S3 files only. It was not possible earlier to write the data directly to Athena database like any other database.
It was missing support support for insert into ....
As workaround, users could have done following steps to make it work.
1. You need to write the pandas output to a file,
2. Save the file to S3 location, from where the AWS Athena is reading.
I hope it gives you some pointers.
Update on 05/01/2020.
On Sep 19, 2019, AWS has announced support for insert to Athena, has made one of statement in above answer incorrect, though above solution that I have provide will still work, but with AWS announcement has added another possible solution going forward.
As AWS Documentation suggests, this feature will allow you send insert statements and Athena will write data back to new file in source table S3 location. So essentially, AWS has resolved your headache of writing data to back S3 files.
Just a note, Athena will write inserted data into separate files.
Here goes the documentation.

The top answer at the time of writing uses an older version of the API, which no longer works.
The documentation now outlines this round trip.
import awswrangler as wr
import pandas as pd
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")

One option is to use:
pandas_df.to_parquet(file, engine="pyarrow)
To save it first to a temporal file in parquet format. For this you need to install pyarrow dependency. Once this file is saved locally, you can push it to S3 using the aws sdk for python.
A new table in Athena can now be created by executig the following queries:
CREATE EXTERNAL TABLE IF NOT EXISTS 'your_new_table'
(col1 type1, col2 type2)
PARTITIONED BY (col_partitions_if_neccesary)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3 location of your parquet file'
tblproperties ("parquet.compression"="snappy");
Another option is to use pyathena for that. Taking the example from their official documentation:
import pandas as pd
from urllib.parse import quote_plus
from sqlalchemy import create_engine
conn_str = "awsathena+rest://:#athena.{region_name}.amazonaws.com:443/"\
"{schema_name}?s3_staging_dir={s3_staging_dir}&s3_dir={s3_dir}&compression=snappy"
engine = create_engine(conn_str.format(
region_name="us-west-2",
schema_name="YOUR_SCHEMA",
s3_staging_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/"),
s3_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/")))
df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
df.to_sql("YOUR_TABLE", engine, schema="YOUR_SCHEMA", index=False, if_exists="replace", method="multi")
In this case, the dependency sqlalchemy is needed.

Related

How do I run a query in Cloud Spanner and download the results to a pandas DataFrame?

I have a tool that processes data in-memory with pandas DataFrames, and I'd like to be able to use Spanner as a data source for some of that processing. How can I use Python to run a query in Spanner and then download all the query results to a pandas DataFrame?
A quick and dirty way to get spanner results into a pandas data frame.
import pandas as pd
from google.cloud import spanner
# Initialize client
client = spanner.Client()
# Get a Cloud Spanner instance by ID.
instance = client.instance('instance-name')
# Get a Cloud Spanner database by ID.
database = instance.database('database-name')
with database.snapshot() as snapshot:
result = snapshot.execute_sql("SELECT * FROM somewhere")
# Stream in rows
rows = list()
for row in result:
rows.append(row)
# Get column names
cols = [x.name for x in result.fields]
# Convert to pandas dataframe
result_df = pd.DataFrame(rows, columns = cols)
This likely won't scale and you may run into issues with Spanner types vs pandas types, but it will solve the immediate problem of "I want to analyze data from Spanner in pandas."
Pandas lib is using sqlalchemy, so we can use this doc: https://cloud.google.com/spanner/docs/use-sqlalchemy
pip install sqlalchemy-spanner
then in python code (if sqlalchemy version >= 1.4):
import pandas as pd
url = 'spanner+spanner:///projects/project-id/instances/instance-id/databases/database-id'
sql = 'SELECT * FROM my_table;'
df = pd.read_sql(sql, url, index_col='id_column')
or in case it's sqlalchemy version 1.3:
...
url = 'spanner:///projects/project-id/instances/instance-id/databases/database-id'
...
To use Python to query your data in Cloud Spanner yoou need to install and use the Python Cloud Spanner client library.
As of now there is no a straightforward way to download data from Spanner to Pandas DataFrame.
I would suggest to use the "StreamedResultSet API" to export your data to Pandas.
Also please take a look at this post about streaming data from Cloud Spanner to Panda dataframe, as it may be proven helpful implementing your use case as well.

Python replacement for SQL bcp.exe

The goal is to load a csv file into an Azure SQL database from Python directly, that is, not by calling bcp.exe. The csv files will have the same number of fields as do the destination tables. It'd be nice to not have to create the format file bcp.exe requires (xml for +-400 fields for each of 16 separate tables).
Following the Pythonic approach, try to insert the data and ask SQL Server to throw an exception if there is a type mismatch, or other.
If you don't want use bcp cammand to import the csv file, you can using Python pandas library.
Here's the example that I import a no header 'test9.csv' file on my computer to Azure SQL database.
Csv file:
Python code example:
import pandas as pd
import sqlalchemy
import urllib
import pyodbc
# set up connection to database (with username/pw if needed)
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};Server=tcp:***.database.windows.net,1433;Database=Mydatabase;Uid=***#***;Pwd=***;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;")
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
# read csv data to dataframe with pandas
# datatypes will be assumed
# pandas is smart but you can specify datatypes with the `dtype` parameter
df = pd.read_csv (r'C:\Users\leony\Desktop\test9.csv',header=None,names = ['id', 'name', 'age'])
# write to sql table... pandas will use default column names and dtypes
df.to_sql('test9',engine,if_exists='append',index=False)
# add 'dtype' parameter to specify datatypes if needed; dtype={'column1':VARCHAR(255), 'column2':DateTime})
Notice:
get the connect string on Portal.
UID format is like [username]#[servername].
Run this scripts and it works:
Please reference these documents:
HOW TO IMPORT DATA IN PYTHON
pandas.DataFrame.to_sql
Hope this helps.

Dataframe Koalas to Delta Table: ERROR: An error occurred while calling o237.save

I read a couple of csv files using Pandas from my driver node, I converted the Pandas Dataframe to a Koalas Dataframe, and finally, I wanna insert the data from Koalas into a Delta table but I obtained an error:
import databricks.koalas as ks
import pandas as pd
import glob
all_files = glob.glob('/databricks/driver/myfolder/')
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
df = ks.from_pandas(frame)
df.to_delta('dbfs:/FileStore/filesTest/%s' % tablename, mode='append')
ERROR: An error occurred while calling o237.save. :
java.lang.IllegalStateException: Cannot find the REPL id in Spark
local properties. Spark-submit and R doesn't support transactional
writes from different clusters. If you are using R, please switch to
Scala or Python. If you are using spark-submit , please convert it to
Databricks JAR job. Or you can disable multi-cluster writes by setting
'spark.databricks.delta.multiClusterWrites.enabled' to 'false'. If
this is disabled, writes to a single table must originate from a
single cluster. Please check
https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions-faq
for more details.
Delta Lake supports transactional writes from multiple clusters in the same workspace in Databricks Runtime 4.2 and above. All writers must be running Databricks Runtime 4.2 or above.
The following features are not supported when running in this mode:
SparkR
spark-submit jobs
Running a command using the REST API
Client-side encryption
Server-Side Encryption with Customer-Provided Encryption Keys
S3 paths with credentials in a cluster that cannot access AWS Security Token Service
Make sure:
If you are using R, please switch to Scala or Python.
If you are using spark-submit , please convert it to Databricks JAR job.
Reference: "Delta Lake - Introductory Notebooks" and "Delta - FAQs".
Hope this helps.

Inserting a Python Dataframe into Hive from an external server

I'm currently using PyHive (Python3.6) to read data to a server that exists outside the Hive cluster and then use Python to perform analysis.
After performing analysis I would like to write data back to the Hive server.
In searching for a solution, most posts deal with using PySpark. In the long term we will set up our system to use PySpark. However, in the short term is there a way to easily write data directly to a Hive table using Python from a server outside of the cluster?
Thanks for your help!
You could use the subprocess module.
The following function will work for data you've already saved locally. For example, if you save a dataframe to csv, you an pass the name of the csv into save_to_hdfs, and it will throw it in hdfs. I'm sure there's a way to throw the dataframe up directly, but this should get you started.
Here's an example function for saving a local object, output, to user/<your_name>/<output_name> in hdfs.
import os
from subprocess import PIPE, Popen
def save_to_hdfs(output):
"""
Save a file in local scope to hdfs.
Note, this performs a forced put - any file with the same name will be
overwritten.
"""
hdfs_path = os.path.join(os.sep, 'user', '<your_name>', output)
put = Popen(["hadoop", "fs", "-put", "-f", output, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()
# example
df = pd.DataFrame(...)
output_file = 'yourdata.csv'
dataframe.to_csv(output_file)
save_to_hdfs(output_file)
# remove locally created file (so it doesn't pollute nodes)
os.remove(output_file)
In which format you want to write data to hive? Parquet/Avro/Binary or simple csv/text format?
Depending upon your choice of serde you use while creating hive table, different python libraries can be used to first convert your dataframe to respective serde, store the file locally and then you can use something like save_to_hdfs (as answered by #Jared Wilber below) to move that file into hdfs hive table location path.
When a hive table is created (default or external table), it reads/stores its data from a specific HDFS location (default or provided location). And this hdfs location can be directly accessed to modify data. Some things to remember if manually updating data in hive tables- SERDE, PARTITIONS, ROW FORMAT DELIMITED etc.
Some helpful serde libraries in python:
Parquet: https://fastparquet.readthedocs.io/en/latest/
Avro:https://pypi.org/project/fastavro/
It took some digging but I was able to find a method using sqlalchemy to create a hive table directly from a pandas dataframe.
from sqlalchemy import create_engine
#Input Information
host = 'username#local-host'
port = 10000
schema = 'hive_schema'
table = 'new_table'
#Execution
engine = create_engine(f'hive://{host}:{port}/{schema}')
engine.execute('CREATE TABLE ' + table + ' (col1 col1-type, col2 col2-type)')
Data.to_sql(name=table, con=engine, if_exists='append')
You can write back.
Convert data of df into such format like you are inserting multiple rows into the table at once eg.. insert into table values (first row of dataframe comma separated ), (second row), (third row).... so on;
thus you can insert.
bundle=df.assign(col='('+df[df.col[0]] + ','+df[df.col[1]] +...+df[df.col[n]]+')'+',').col.str.cat(' ')[:-1]
con.cursor().execute('insert into table table_name values'+ bundle)
and you are done.

How to read an ORC file stored locally in Python Pandas?

Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?
The filename is someFile.snappy.orc
I can see online that spark.read.orc('someFile.snappy.orc') works, but even after import pyspark, it is throwing error.
I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.
import pandas as pd
import pyarrow.orc as orc
with open(filename) as file:
data = orc.ORCFile(file)
df = data.read().to_pandas()
In case import pyarrow.orc as orc does not work (did not work for me in Windows 10), you can read them to Spark data frame then convert to pandas's data frame
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.getOrCreate()
df_spark = spark.read.orc('example.orc')
df_pandas = df_spark.toPandas()
Starting from Pandas 1.0.0, there is a built in function for Pandas.
https://pandas.pydata.org/docs/reference/api/pandas.read_orc.html
import pandas as pd
import pyarrow.orc
df = pd.read_orc('/tmp/your_df.orc')
Be sure to read this warning about dependencies. This function might not work on Windows
https://pandas.pydata.org/docs/getting_started/install.html#install-warn-orc
If you want to use
read_orc(), it is highly recommended to install pyarrow using conda
ORC, like AVRO and PARQUET, are format specifically designed for massive storage. You can think about them "like a csv", they are all files containing data, with their particular structure (different than csv, or a json of course!).
Using pyspark should be easy reading an orc file, as soon as your environment grants the Hive support.
Answering your question, I'm not sure that in a local environment without Hive you will be able to read it, I've never done it (you can do a quick test with the following code):
Loads ORC files, returning the result as a DataFrame.
Note: Currently ORC support is only available together with Hive support.
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
Hive is a data warehouse system, that allows you to query your data on HDFS (distributed file system) through Map-Reduce like a traditional relational database (creating queries SQL-like, doesn't support 100% all the standard SQL features!).
Edit: Try the following to create a new Spark Session. Not to be rude, but I suggest you to follow one of many PySpark tutorial in order to understand the basics of this "world". Everything will be much clearer.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
Easiest way is using pyorc:
import pyorc
import pandas as pd
with open(r"my_orc_file.orc", "rb") as orc_file:
reader = pyorc.Reader(orc_file)
orc_data = reader.read()
orc_schema = reader.schema
columns = list(orc_schema.fields)
df = pd.DataFrame(data=orc_data, columns=columns)
I did not want to submit a spark job to read local ORC files or have pandas. This worked for me.
import pyarrow.orc as orc
data_reader = orc.ORCFile("/path/to/orc/part_file.zstd.orc")
data = data_reader.read()
source = data.to_pydict()

Categories

Resources