Convert JSON to CSV using Azure Databricks - python

I'm new to Azure Databricks so I am having a hard time importing JSON data and converting it to CSV using Azure Databricks even after reading the documentation.
After converting JSON to CSV, I need to combine it with another CSV data that has a mutual column.
Any help would be really appreciated. Thank you

Are you looking to join on the mutual column? If so you can do something like this:
dfjson = spark.read.json(/path/to/json)
dfcsv = spark.read.csv(/path/to/csv)
dfCombined = dfjson.join(dfcsv, dfjson.mutualCol == dfcsv.mutualCol, joinType)
dfCombined.save.format(someFormat).write(/path/to/output)

Related

Which separator to use in pandas/dask read csv when importing the table from hive?

I am trying to import a table from hive to python(jupyter) in form of dask dataframe from but it's not getting imported properly.So,can anyone help me with right separator in this case.Please refer the image.

Write pandas dataframe into AWS athena database

I have run a query using pyathena, and have created a pandas dataframe. Is there a way to write the pandas dataframe to AWS athena database directly?
Like data.to_sql for MYSQL database.
Sharing a example of dataframe code below for reference need to write into AWS athena database:
data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})
Another modern (as for February 2020) way to achieve this goal is to use aws-data-wrangler library. It's authomating many routine (and sometimes annoying) tasks in data processing.
Combining the case from the question the code would look like below:
import pandas as pd
import awswrangler as wr
data=pd.DataFrame({'id':[1,2,3,4,5,6],'name':['a','b','c','d','e','f'],'score':[11,22,33,44,55,66]})
# Typical Pandas, Numpy or Pyarrow transformation HERE!
wr.pandas.to_parquet( # Storing the data and metadata to Data Lake
dataframe=data,
database="database",
path="s3://your-s3-bucket/path/to/new/table",
partition_cols=["name"],
)
This is amazingly helpful, because aws-data-wrangler knows to parse table name from the path (but you can provide table name in the parameter) and define proper types in Glue catalog according to the dataframe.
It also helpful for querying the data with Athena directly to pandas dataframe:
df = wr.pandas.read_table(database="dataase", table="table")
All the process will be fast and convenient.
Storage for AWS Athena is S3. And it reads data from S3 files only. It was not possible earlier to write the data directly to Athena database like any other database.
It was missing support support for insert into ....
As workaround, users could have done following steps to make it work.
1. You need to write the pandas output to a file,
2. Save the file to S3 location, from where the AWS Athena is reading.
I hope it gives you some pointers.
Update on 05/01/2020.
On Sep 19, 2019, AWS has announced support for insert to Athena, has made one of statement in above answer incorrect, though above solution that I have provide will still work, but with AWS announcement has added another possible solution going forward.
As AWS Documentation suggests, this feature will allow you send insert statements and Athena will write data back to new file in source table S3 location. So essentially, AWS has resolved your headache of writing data to back S3 files.
Just a note, Athena will write inserted data into separate files.
Here goes the documentation.
The top answer at the time of writing uses an older version of the API, which no longer works.
The documentation now outlines this round trip.
import awswrangler as wr
import pandas as pd
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")
One option is to use:
pandas_df.to_parquet(file, engine="pyarrow)
To save it first to a temporal file in parquet format. For this you need to install pyarrow dependency. Once this file is saved locally, you can push it to S3 using the aws sdk for python.
A new table in Athena can now be created by executig the following queries:
CREATE EXTERNAL TABLE IF NOT EXISTS 'your_new_table'
(col1 type1, col2 type2)
PARTITIONED BY (col_partitions_if_neccesary)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3 location of your parquet file'
tblproperties ("parquet.compression"="snappy");
Another option is to use pyathena for that. Taking the example from their official documentation:
import pandas as pd
from urllib.parse import quote_plus
from sqlalchemy import create_engine
conn_str = "awsathena+rest://:#athena.{region_name}.amazonaws.com:443/"\
"{schema_name}?s3_staging_dir={s3_staging_dir}&s3_dir={s3_dir}&compression=snappy"
engine = create_engine(conn_str.format(
region_name="us-west-2",
schema_name="YOUR_SCHEMA",
s3_staging_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/"),
s3_dir=quote_plus("s3://YOUR_S3_BUCKET/path/to/")))
df = pd.DataFrame({"a": [1, 2, 3, 4, 5]})
df.to_sql("YOUR_TABLE", engine, schema="YOUR_SCHEMA", index=False, if_exists="replace", method="multi")
In this case, the dependency sqlalchemy is needed.

How to read an ORC file stored locally in Python Pandas?

Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?
The filename is someFile.snappy.orc
I can see online that spark.read.orc('someFile.snappy.orc') works, but even after import pyspark, it is throwing error.
I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.
import pandas as pd
import pyarrow.orc as orc
with open(filename) as file:
data = orc.ORCFile(file)
df = data.read().to_pandas()
In case import pyarrow.orc as orc does not work (did not work for me in Windows 10), you can read them to Spark data frame then convert to pandas's data frame
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.getOrCreate()
df_spark = spark.read.orc('example.orc')
df_pandas = df_spark.toPandas()
Starting from Pandas 1.0.0, there is a built in function for Pandas.
https://pandas.pydata.org/docs/reference/api/pandas.read_orc.html
import pandas as pd
import pyarrow.orc
df = pd.read_orc('/tmp/your_df.orc')
Be sure to read this warning about dependencies. This function might not work on Windows
https://pandas.pydata.org/docs/getting_started/install.html#install-warn-orc
If you want to use
read_orc(), it is highly recommended to install pyarrow using conda
ORC, like AVRO and PARQUET, are format specifically designed for massive storage. You can think about them "like a csv", they are all files containing data, with their particular structure (different than csv, or a json of course!).
Using pyspark should be easy reading an orc file, as soon as your environment grants the Hive support.
Answering your question, I'm not sure that in a local environment without Hive you will be able to read it, I've never done it (you can do a quick test with the following code):
Loads ORC files, returning the result as a DataFrame.
Note: Currently ORC support is only available together with Hive support.
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
Hive is a data warehouse system, that allows you to query your data on HDFS (distributed file system) through Map-Reduce like a traditional relational database (creating queries SQL-like, doesn't support 100% all the standard SQL features!).
Edit: Try the following to create a new Spark Session. Not to be rude, but I suggest you to follow one of many PySpark tutorial in order to understand the basics of this "world". Everything will be much clearer.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
Easiest way is using pyorc:
import pyorc
import pandas as pd
with open(r"my_orc_file.orc", "rb") as orc_file:
reader = pyorc.Reader(orc_file)
orc_data = reader.read()
orc_schema = reader.schema
columns = list(orc_schema.fields)
df = pd.DataFrame(data=orc_data, columns=columns)
I did not want to submit a spark job to read local ORC files or have pandas. This worked for me.
import pyarrow.orc as orc
data_reader = orc.ORCFile("/path/to/orc/part_file.zstd.orc")
data = data_reader.read()
source = data.to_pydict()

How to derive from RDD and create your own?

We have our own proprietary format for storing polygon and shapes in an image. I would like to use Spark to process this format. Is it possible to create my own reader in SparkContext to read the proprietary format and populate RDDs? I would like to create a derived class of existing RDDs which would be populated by my reader in SparkContext. I would like to do this in Python. Any suggestions or links is appreciated.
You should be able to simply read the data and convert it to an RDD using your Spark Context. You can then action on the data using Spark.
Example:
val sc = new SparkContext(sparkConf)
val result : RDD[MyCustomObject] =
sc
.parallelize(Source.fromFile("/tmp/DataFile.csv")
.getLines()
.drop(1)
.map(x => MyCustomObject(x))
This applies to any format of data - although if you're able to read from HDFS, Cassandra, etc. the data will be read via distributed read and presented as an RDD.

Hive UDF with Python

I'm new to python, pandas, and hive and would definitely appreciate some tips.
I have the python code below, which I would like to turn into a UDF in hive. Only instead of taking a csv as the input, doing the transformations and then exporting another csv, I would like to take a hive table as the input, and then export the results as a new hive table containing the transformed data.
Python Code:
import pandas as pd
data = pd.read_csv('Input.csv')
df = data
df = df.set_index(['Field1','Field2'])
Dummies=pd.get_dummies(df['Field3']).reset_index()
df2=Dummies.drop_duplicates()
df3=df2.groupby(['Field1','Field2']).sum()
df3.to_csv('Output.csv')
You can make use of the TRANSFORM function to make use of a UDF written in Python. The detailed steps are outlined here and here.

Categories

Resources