Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?
The filename is someFile.snappy.orc
I can see online that spark.read.orc('someFile.snappy.orc') works, but even after import pyspark, it is throwing error.
I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.
import pandas as pd
import pyarrow.orc as orc
with open(filename) as file:
data = orc.ORCFile(file)
df = data.read().to_pandas()
In case import pyarrow.orc as orc does not work (did not work for me in Windows 10), you can read them to Spark data frame then convert to pandas's data frame
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.getOrCreate()
df_spark = spark.read.orc('example.orc')
df_pandas = df_spark.toPandas()
Starting from Pandas 1.0.0, there is a built in function for Pandas.
https://pandas.pydata.org/docs/reference/api/pandas.read_orc.html
import pandas as pd
import pyarrow.orc
df = pd.read_orc('/tmp/your_df.orc')
Be sure to read this warning about dependencies. This function might not work on Windows
https://pandas.pydata.org/docs/getting_started/install.html#install-warn-orc
If you want to use
read_orc(), it is highly recommended to install pyarrow using conda
ORC, like AVRO and PARQUET, are format specifically designed for massive storage. You can think about them "like a csv", they are all files containing data, with their particular structure (different than csv, or a json of course!).
Using pyspark should be easy reading an orc file, as soon as your environment grants the Hive support.
Answering your question, I'm not sure that in a local environment without Hive you will be able to read it, I've never done it (you can do a quick test with the following code):
Loads ORC files, returning the result as a DataFrame.
Note: Currently ORC support is only available together with Hive support.
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
Hive is a data warehouse system, that allows you to query your data on HDFS (distributed file system) through Map-Reduce like a traditional relational database (creating queries SQL-like, doesn't support 100% all the standard SQL features!).
Edit: Try the following to create a new Spark Session. Not to be rude, but I suggest you to follow one of many PySpark tutorial in order to understand the basics of this "world". Everything will be much clearer.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
Easiest way is using pyorc:
import pyorc
import pandas as pd
with open(r"my_orc_file.orc", "rb") as orc_file:
reader = pyorc.Reader(orc_file)
orc_data = reader.read()
orc_schema = reader.schema
columns = list(orc_schema.fields)
df = pd.DataFrame(data=orc_data, columns=columns)
I did not want to submit a spark job to read local ORC files or have pandas. This worked for me.
import pyarrow.orc as orc
data_reader = orc.ORCFile("/path/to/orc/part_file.zstd.orc")
data = data_reader.read()
source = data.to_pydict()
Related
I have run a query using pyathena, and have created a pandas dataframe. Is there a way to write the pandas dataframe to AWS athena database directly?
Like data.to_sql for MYSQL database.
Sharing a example of dataframe code below for reference need to write into AWS athena database:
data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})
Another modern (as for February 2020) way to achieve this goal is to use aws-data-wrangler library. It's authomating many routine (and sometimes annoying) tasks in data processing.
Combining the case from the question the code would look like below:
import pandas as pd
import awswrangler as wr
data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})
# Typical Pandas, Numpy or Pyarrow transformation HERE!
wr.pandas.to_parquet( # Storing the data and metadata to Data Lake
dataframe=data,
database="database",
path="s3://your-s3-bucket/path/to/new/table",
partition_cols=["name"],
)
This is amazingly helpful, because aws-data-wrangler knows to parse table name from the path (but you can provide table name in the parameter) and define proper types in Glue catalog according to the dataframe.
It also helpful for querying the data with Athena directly to pandas dataframe:
df = wr.pandas.read_table(database="dataase", table="table")
All the process will be fast and convenient.
Storage for AWS Athena is S3. And it reads data from S3 files only. It was not possible earlier to write the data directly to Athena database like any other database.
It was missing support support for insert into ....
As workaround, users could have done following steps to make it work.
1. You need to write the pandas output to a file,
2. Save the file to S3 location, from where the AWS Athena is reading.
I hope it gives you some pointers.
Update on 05/01/2020.
On Sep 19, 2019, AWS has announced support for insert to Athena, has made one of statement in above answer incorrect, though above solution that I have provide will still work, but with AWS announcement has added another possible solution going forward.
As AWS Documentation suggests, this feature will allow you send insert statements and Athena will write data back to new file in source table S3 location. So essentially, AWS has resolved your headache of writing data to back S3 files.
Just a note, Athena will write inserted data into separate files.
Here goes the documentation.
The top answer at the time of writing uses an older version of the API, which no longer works.
The documentation now outlines this round trip.
import awswrangler as wr
import pandas as pd
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")
One option is to use:
pandas_df.to_parquet(file, engine="pyarrow)
To save it first to a temporal file in parquet format. For this you need to install pyarrow dependency. Once this file is saved locally, you can push it to S3 using the aws sdk for python.
A new table in Athena can now be created by executig the following queries:
CREATE EXTERNAL TABLE IF NOT EXISTS 'your_new_table'
(col1 type1, col2 type2)
PARTITIONED BY (col_partitions_if_neccesary)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3 location of your parquet file'
tblproperties ("parquet.compression"="snappy");
Another option is to use pyathena for that. Taking the example from their official documentation:
import pandas as pd
from urllib.parse import quote_plus
from sqlalchemy import create_engine
conn_str = "awsathena+rest://:#athena.{region_name}.amazonaws.com:443/"\
"{schema_name}?s3_staging_dir={s3_staging_dir}&s3_dir={s3_dir}&compression=snappy"
engine = create_engine(conn_str.format(
region_name="us-west-2",
schema_name="YOUR_SCHEMA",
s3_staging_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/"),
s3_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/")))
df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
df.to_sql("YOUR_TABLE", engine, schema="YOUR_SCHEMA", index=False, if_exists="replace", method="multi")
In this case, the dependency sqlalchemy is needed.
Currently I'm using the code below on Python 3.5, Windows to read in a parquet file.
import pandas as pd
parquetfilename = 'File1.parquet'
parquetFile = pd.read_parquet(parquetfilename, columns=['column1', 'column2'])
However, I'd like to do so without using pandas. How to best do this? I'm using both Python 2.7 and 3.6 on Windows.
You can use duckdb for this. It's an embedded RDBMS similar to SQLite but with OLAP in mind. There's a nice Python API and a SQL function to import Parquet files:
import duckdb
conn = duckdb.connect(":memory:") # or a file name to persist the DB
# Keep in mind this doesn't support partitioned datasets,
# so you can only read one partition at a time
conn.execute("CREATE TABLE mydata AS SELECT * FROM parquet_scan('/path/to/mydata.parquet')")
# Export a query as CSV
conn.execute("COPY (SELECT * FROM mydata WHERE col = 'val') TO 'col_val.csv' WITH (HEADER 1, DELIMITER ',')")
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')
When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')
Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
#delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()
Considering the .parquet file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Convert to Parquet
Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
Read from Parquet
In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows
new_parquet_df = pd.read_parquet(parquet_file)
I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame "table" to a csv file?
Thanks!
If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:
df.toPandas().to_csv('mycsv.csv')
Otherwise you can use spark-csv:
Spark 1.3
df.save('mycsv.csv', 'com.databricks.spark.csv')
Spark 1.4+
df.write.format('com.databricks.spark.csv').save('mycsv.csv')
In Spark 2.0+ you can use csv data source directly:
df.write.csv('mycsv.csv')
For Apache Spark 2+, in order to save dataframe into single csv file. Use following command
query.repartition(1).write.csv("cc_out.csv", sep='|')
Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.
If you cannot use spark-csv, you can do the following:
df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")
If you need to handle strings with linebreaks or comma that will not work. Use this:
import csv
import cStringIO
def row2csv(row):
buffer = cStringIO.StringIO()
writer = csv.writer(buffer)
writer.writerow([str(s).encode("utf-8") for s in row])
buffer.seek(0)
return buffer.read().strip()
df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")
You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,
df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')
Read more about the repartition function
Read more about the save function
However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.
Read more on repartition vs coalesce functions.
Using PySpark
Easiest way to write in csv in Spark 3.0+
sdf.write.csv("/path/to/csv/data.csv")
this can generate multiple files based on the number of spark nodes you are using. In case you want to get it in a single file use repartition.
sdf.repartition(1).write.csv("/path/to/csv/data.csv")
Using Pandas
If your data is not too much and can be held in the local python, then you can make use of pandas too
sdf.toPandas().to_csv("/path/to/csv/data.csv", index=False)
Using Koalas
sdf.to_koalas().to_csv("/path/to/csv/data.csv", index=False)
How about this (in case you don't want a one liner) ?
for row in df.collect():
d = row.asDict()
s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
f.write(s)
f is an opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.
'''
I am late to the pary but: this will let me rename the file, move it to a desired directory and delete the unwanted additional directory spark made
'''
import shutil
import os
import glob
path = 'test_write'
#write single csv
students.repartition(1).write.csv(path)
#rename and relocate the csv
shutil.move(glob.glob(os.getcwd() + '\\' + path + '\\' + r'*.csv')[0], os.getcwd()+ '\\' + path+ '.csv')
#remove additional directory
shutil.rmtree(os.getcwd()+'\\'+path)
try display(df) and use the download option in the results. Please note: only 1 million rows can be downloaded with this option but its really quick.
I used the method with pandas and this gave me horrible performance. In the end it took so long that I stopped to look for another method.
If you are looking for a way to write to one csv instead of multiple csv's this would be what you are looking for:
df.coalesce(1).write.csv("train_dataset_processed", header=True)
It reduced processing my dataset from 2+ hours to 2 minutes
I'm new to python, pandas, and hive and would definitely appreciate some tips.
I have the python code below, which I would like to turn into a UDF in hive. Only instead of taking a csv as the input, doing the transformations and then exporting another csv, I would like to take a hive table as the input, and then export the results as a new hive table containing the transformed data.
Python Code:
import pandas as pd
data = pd.read_csv('Input.csv')
df = data
df = df.set_index(['Field1','Field2'])
Dummies=pd.get_dummies(df['Field3']).reset_index()
df2=Dummies.drop_duplicates()
df3=df2.groupby(['Field1','Field2']).sum()
df3.to_csv('Output.csv')
You can make use of the TRANSFORM function to make use of a UDF written in Python. The detailed steps are outlined here and here.