AWS Glue ETL job missing fields visible to crawler - python

I have a table created by a crawler pointing to some parquet files stored in s3. From the the Glue data catalogue GUI I can see many fields (53).
When I open up an ETL dev endpoint and connect with a sagemaker notebook, load the same table and run printSchema, I see a lot less fields (36) using the code below.
from pyspark.context import SparkContext
from awsglue.context import GlueContext, DynamicFrame
# Get the right stuff
glueContext = GlueContext(SparkContext.getOrCreate())
data = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table")
print('Fields: ', len(data.schema().fields))
data.printSchema()
returns only 36 fields. Can anyone tell me how to access the missing fields? It seems to happen most frequently on fields that are sparsely populated.
Edit: This unanswered question on the AWS forums seems to be due to the same issue - apparently PySpark tries to infer its own schema rather than use the one found by the crawler.

For parquet files Glue uses Spark's reader and therefore rely on the schema inherited from a file instead of using the one from Data Catalog created by crawler.
If source folder has files with different schemas then Glue crawler merges it into a single schema which makes it different from the one you see in ETL.

Have you tried .create_dynamic_frame.from_options and read directly from s3 bucket? Sometimes that behaves differently from Crawler.

Have you tried 'Update all new and existing partitions with metadata from the table' in 'Output -> Configuration options (optional)' section of Crawler?

Related

Pyspark: Incremental load, How to overwrite/update Hive table where data is being read

I'm currently writing a script for a daily incremental ETL. I used a initial load script to load base data to a hive table. Thereafter, I created a daily incremental script and reads from the same table, and uses that same data to run the 2nd script.
Initially, I tried to "APPEND" the new data with the daily incremental script, however that seemed to create duplicate rows. So, now I'm attempting to "OVERWRITE" the hive table instead, thus creating the below exception.
I noticed others with a similar issue, that want to read and overwrite the same table have tried to "refreshTable" before overwriting... I tried this solution as well, but I'm still receiving the same error?
Maybe I should refresh the table path as well?
-Thanks
The Error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://essnp/sync/dev_42124_edw_b/EDW/SALES_MKTG360/RZ/FS_FLEET_ACCOUNT_LRF/Data/part-00000-4db6432b-f59c-4112-83c2-672140348454-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
End of my code:
###### Loading the LRF table ######
spark.catalog.refreshTable(TABLE_NAME)
hive.write_sdf_as_parquet(spark,final_df_converted,TABLE_NAME,TABLE_PATH,mode='overwrite')
print("LOAD COMPLETED " + str(datetime.now()))
####### Ending SparkSession #######
spark.sparkContext.stop()
spark.stop() ```
It's not a good habit to read and write to same path,
as per spark DAG lineage, some time read and write both happens at the same time, so it's expected.
Better to read from one location and write to another location.

How can I create table from parquet file

Given a parquet file how can I create the table associated with it into my redshift database? Oh the format of the parquet file is snappy.
If you're dealing with multiple files, especially over a long term, then I think the best solution is to upload them to an S3 bucket and run a Glue crawler.
In addition to populating the Glue data catalog, you can also use this information to configure external tables for Redshift Spectrum, and create your on-cluster tables using create table as select.
If this is just a one-off task, then I've used parquet-tools in the past. The version that I've used is a Java library, but I see that there's also a version on PyPi.

Is there any way we can load BigTable data into BigQuery?

I want to load BigTable data into BigQuery with direct way.
Till now I am loading BigTable data into CSV file using Python and then loading csv file into BigQuery.
But I don't want to use csv file in between BigTable and BigQuery is there any direct way ?
To add to Mikhail's recommendation, I'd suggest creating a permanent table in BigQuery using the external table. You'll define the schema for the columns you want and then query the rows you're interested in. Once that data is saved into BigQuery, it won't have any impact on your Bigtable performance. If you want to get the latest data, you can create a new permanent table with the same query.
If you're looking to have the data copied over and stored in BigQuery, Querying Cloud Bigtable data using permanent external tables is not what you're looking for. It explicitly mentions that "The data is not stored in the BigQuery table". My understanding is that the permanent table is more for persistent access controls, but still queries Bigtable directly.
This may be overkill, but you could set up and Apache Beam pipeline that runs in Dataflow, has a BigQueryIO source, and a BigTableIO sink. You'd have to write a little bit of transformation logic, but overall it should be a pretty simple pipeline. The only catch here is that the BigTableIO connector is only for the Beam Java SDK, so you'd have to write this pipeline in Java.

AWS Glue crawler: different schema for input data

I have a subfolder in a S3 bucket to store CSV files. These CSV files all contain data from one specific data source. The data source provides a new CSV file monthly. I have about 4 years worth of data.
At some point (~2 years ago) the data source decided to change the format of the data. The schema of the CSV changed (some columns were removed). The data is still more or less the same, and everything I want is still there.
I want to use a crawler to register both schemas, preferably in the same table. Ideally, I would like it to recognize the two versions of the schema.
How should I do that?
What I tried
I uploaded all the files in the subfolder and run a crawler with "Create a single schema for each S3 path" enabled.
Result: I got one table with both schemas merged: one big schema with all the columns from both formats
I uploaded all the files in the subfolder and run a crawler with "Create a single schema for each S3 path" disabled.
Result: I got two tables with the two distinct schemas
Why I need this
The two different schemas need to be processed differently. I'm writing a Python shell job to process the files. My idea was to use the catalog to pull the two different versions of the schema, and trigger different treatments for each file depending on the schema of the file.

How to use PySpark to stream data into MySQL database?

I am currently working on a single page web app that allows users to upload large CSV files (currently testing a ~7GB file) to a flask server and then stream that dataset to a database. The upload takes about a minute and the file gets fully saved to a temporary file on the flask server. Now I need to be able to stream this file and store it into a database. I did some research and found that PySpark is great for streaming data and I am choosing MySQL as the database to stream the CSV data into (but I am open to other dbs and streaming methods). I am a junior dev and new to PySpark so I'm not sure how to go about this. The Spark streaming guide says that data must ingested through a source like Kafka, Flume, TCP socets, etc so I am wondering if I have to use any of those methods to get my CSV file into Spark. However, I came across this great example where they are streaming csv data into Azure SQL database and it looks like they are just reading the file directly using Spark without needing to ingest it through a streaming source like Kafka, etc. The only thing that confuses me with that example is that they are using a HDInsight Spark cluster to stream data into the db and I'm not sure how to incorporate that all with a flask server. I appologize for the lack of code but currently I just have a flask server file with one route doing the file upload. Any examples, tutorials, or advice would be greatly appreciated.
I am not sure about the streaming part but spark can handle large files efficiently - and storing to a db table will be done in parallel, so without much knowledge about your details, and provided that you have the uploaded file on your server, I'd say that:
If I wanted to save a big structured file like a csv in a table, I would start like this:
# start with some basic spark configuration, e.g. we want the timezone to be UTC
conf = SparkConf()
conf.set('spark.sql.session.timeZone', 'UTC')
# this is important: you need to have the mysql connector jar for the right mysql version:
conf.set('jars', 'path to mysql connector jar you can download from here: https://dev.mysql.com/downloads/connector/odbc/')
# instantiate a spark session: the first time it will take a few seconds
spark = SparkSession.builder \
.config(conf=conf) \
.appName('Huge File uploader') \
.getOrCreate()
# read the file first as a dataframe
df = spark.read.csv('path to 7GB/ huge csv file')
# optionally, add a filename column
from pyspark.sql import functions as F
df = df.withColumn('filename', F.lit('thecurrentfilename'))
# write it to the table
df.write.format('jdbc').options(
url='e.g. localhost:port',
driver='com.mysql.cj.jdbc.Driver', # the driver for MySQL
dbtable='the table name to save to',
user='user',
password='secret',
).mode('append').save()
Note the mode 'append' here: the catch in this is that spark cannot perform updates on a table, it is either append the new rows or replace what is in the table.
So, if your csv is like this:
id, name, address....
You will end up with a table with the same fields.
This is the most basic example I could think of so that you start with spark, with no considerations about a spark cluster or anything else related. I would suggest you take this for a spin and figure out if this suits your needs :)
Also, keep in mind that this might take a few seconds or more depending on your data, where the database is located, your machine and your database load, so it might be a good idea to keep things asynchronous with your api, again I don't know about any of your other details.
Hope this helps. Good luck!

Categories

Resources