Data Ingestion: Load Dynamic Files from S3 to Snowflake

Data Ingestion: Load Dynamic Files from S3 to Snowflake - python

Situation: A csv lands into AWS S3 every month. The vendor adds/removes/modifies columns from the file as they please. So the schema is not known ahead of time. The requirement is to create a table on-the-fly in Snowflake and load the data into said table. Matillion is our ELT tool.
This is what I have done so far.
Setup a Lambda to detect the arrival of the file, convert it to JSON, upload to another S3 dir and adds filename to SQS.
Matillion detects SQS message and loads the file with the JSON Data into Variant column in a SF table.
SF Stored proc takes the variant column and generates a table based on the number of fields in the JSON data. The VARIANT column in SF only works in this way if its JSON data. CSV is sadly not supported.
This works with 10,000 rows. The problem arises when I run this with a full file which is over 1GB, which is over 10M rows. It crashes the lambda job with an out of disk space error at runtime.
These are the alternatives I have thought of so far:
Attach an EFS volume to the lambda and use it to store the JSON file prior to the upload to S3. JSON data files are so much larger than their CSV counterparts, I expect the json file to be around 10-20GB since the file has over 10M rows.
Matillion has an Excel Query component where it can take the headers and create a table on the fly and load the file. I was thinking I can convert the header row from the CSV into a XLX file within the Lambda, pass it to over to Matillion, have it create the structures and then load the csv file once the structure is created.
What are my other options here? Considerations include a nice repeatable design pattern to be used for future large CSVs or similar requirements, costs of the EFS, am I making the best use of the tools that I are avaialable to me? Thanks!!!

Why not split the initial csv file into multiple files and then process each file in the same way you currently are?

Why are you converting CSV into JSON; CSV is directly being loaded into table without doing any data transformation specifically required in case of JSON, the lateral flatten to convert json into relational data rows; and why not use Snowflake Snowpipe feature to load data directly into Snowflake without use of Matallion. You can split large csv files into smaller chunks before loading into Snowflake ; this will help in distributing the data processing loads across SF Warehouses.

I also load CSV files from SFTP into Snowflake, using Matillion, with no idea of the schema.
In my process, I create a "temp" table in Snowflake, with 50 VARCHAR columns (Our files should never exceed 50 columns). Our data always contains text, dates or numbers, so VARCHAR isn't a problem. I can then load the .csv file into the temp table. I believe this should work for files coming from S3 as well.
That will at least get the data into Snowflake. How to create the "final" table however, given your scenario, I'm not sure.
I can imagine being able to use the header row, and/or doing some analysis on the 'type' of data contained in each column, to determine the column type needed.
But if you can get the 'final' table created, you could move the data over from temp. Or alter the temp table itself.

This can be achieved using an external table where the external table will be mapped with a single column and the delimiter will be a new line character. The external table also has a special virtual column and that can be processed to extract all the columns dynamically and then create a table based on the number of columns at any given time using the stored procedure. There is an interesting video which talks about this limitation in snowflake (https://youtu.be/hRNu58E6Kmg)

Related

how to create a table in redshift from multiple json files in S3

I wanted to know if there is a possibility to create a single table containing all the JSON files from an s3 bucket, I've searched a lot and I can't find a solution for this, if anyone can help with any tips I'd appreciate it.

Yes it is possible but it is not clear what your intent is. If you have a bucket with a set of json files that are in a Redshift readable format and have common data that can be mapped into columns, then this is fairly straight forward. The COPY command can read all the files in the bucket and apply a common mapping to the tables columns. Is this what you want?
Or do you have a bunch of dissimilar json files in various structures that you want to load some information from each into a Redshift table? Then you will likely want to use a Glue Crawler to inventory the jsons and load them separately into Redshift and then combine the common information into a single Redshift table.
Plus there are many other possibilities of what you need. The bottom line is that you are asking to load many unstructured files into a structured database. There is some mapping that needs to happen but depending on what your data looks like this can be fairly simple or quite complex.

Get Metadata from S3 parquet file using Pyarrow

I have a parquet file in s3 to which I will be automatically appending additional data every week. The data has timestamps at 5-minute intervals. I do not want to append any duplicate data during my updates, so what I am trying to accomplish is read ONLY the max/oldest timestamp within the data saved in s3. Then, I will make sure that all of the timestamps in the data I will be appending are older than that time before appending. I don't want to read the entire dataset from s3 in an effort to increase speed/preserve memory as the dataset continues to grow.
Here is an example of what I am doing now to read the entire file:
from pyarrow import fs
import pyarrow.parquet as pq
s3, path = fs.S3FileSystem(access_key, secret_key).from_uri(uri)
dataset = pq.ParquetDataset(path, filesystem=s3)
table = dataset.read()
But I am looking for something more like this (I am aware this isn't correct, but hopefully it conveys what I am attempting to accomplish):
max_date = pq.ParquetFile(path, filesystem=s3).metadata.row_group(0).column('timestamp').statistics['max']
I am pretty new to using both Pyarrow and AWS, so any help would be fantastic (including alternate solutions to my problem I described).

From a purely pedantic perspective I would phrase the problem statement a little differently as "I have a parquet dataset in S3 and will be appending new parquet files on a regular basis". I only mention that because the pyarrow documentation is written with that terminology in mind (e.g. you cannot append to a parquet file with pyarrow but you can append to a parquet dataset) and so it might help understanding.
The pyarrow datasets API doesn't have any operations to retrieve dataset statistics today (it might not be a bad idea to request the feature as a JIRA). However, it can help a little in finding your fragments. What you have doesn't seem that far off to me.
s3, path = fs.S3FileSystem(access_key, secret_key).from_uri(uri)
# At this point a call will be made to S3 to list all the files
# in the directory 'path'
dataset = pq.ParquetDataset(path, filesystem=s3)
max_timestamp = None
for fragment in dataset.get_fragments():
field_index = fragment.physical_schema.get_field_index('timestamp')
# This will issue a call to S3 to load the metadata
metadata = fragment.metadata
for row_group_index in range(metadata.num_row_groups):
stats = metadata.row_group(row_group_index).column(field_index).statistics
# Parquet files can be created without statistics
if stats:
row_group_max = stats.max
if max_timestamp is None or row_group_max > max_timestamp:
max_timestamp = row_group_max
print(f"The maximum timestamp was {max_timestamp}")
I've annotated the places where actual calls to S3 will be made. This will certainly be faster than loading all of the data but there is still going to be some overhead which will grow as you add more files. This overhead could get quite high if you are running outside of the AWS region. You could mitigate this by scanning the fragments in parallel but that will be extra work.
It would be faster to store the max_timestamp in a dedicated statistics file whenever you update the the data in your dataset. That way there is only ever one small file you need to read. If you're managing the writes yourself you might look into a table format like Apache Iceberg which is a standard format for storing this kind of extra information and statistics about a dataset (what Arrow calls a "dataset" Iceberg calls a "table").

Best way to save an np.array or a python list object as a single record in BigQuery?

I have an ML model (text embedding) which outputs a large 1024 length vector of floats, which I want to persist in a BigQuery table.
The individual values in the vector don't mean anything on their own, the entire vector is the feature of interest. Hence, I want to store these lists in a single Column in BigQuery as opposed to one column for each float. Additionally, adding an additional 1024 rows to a table that is originally just 4 or 5 rows seems like a bad idea.
Is there a way of storing a python list or an np.array in a column in BigQuery (maybe convert them to a json first or something along those lines?)

Maybe it's not exactly you were looking for, but the following options are the closest workarounds to what you're trying to achieve.
First of all, you can save your data in an CSV file with one column locally and then load that file into BigQuery. There are also other file formats that can be loaded into BigQuery from a local machine, that might interest you. I personally would go with a CSV.
I did the experiment, by creating an empty table in my dataset, without adding a field. Then I used the code mentioned in the first link, after saving a column of my random data in a CSV file.
If you encounter the following error regarding the permissions, see this solution. It uses an authentication key instead.
google.api_core.exceptions.Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/project-name/jobs/job-id?location=EU: Request had insufficient authentication scopes.
Also, you might find this link useful, in case you get the following error:
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table my-project:my_dataset.random_data. Cannot add fields (field: double_field_0)
Besides loading your data from a local file, can upload your data file on Google Cloud Storage and load the data from there. Many file formats are being supported, as Avro, Parquet, ORC, CSV and newline delimited JSON.
Finally there is an option for streaming the data directly into a BigQuery table by using the API, but it is not available via the free tier.

Data Visualization with hundreds of AWS billing Data CSV files when the header values change over time

I am developing a Data Visualization Dashboard in Tableau with hundreds of CSV files in the AWS S3 bucket and every day new files will be generated.
In order to achieve this and make the process faster, I am loading the files into AWS Redshift DB. The CSV file has new columns and sometimes previously existing columns will not be present in the incoming files. In order to handle this I have modified my code to read and compare the headers, if new headers are present it will alter table, add new columns.
However the issue I am facing is the following:
CSV file header values change over time, i.e If the current value of a column is 'cost', in the next month the 'cost' column might not be present but it's mapped to a new column by value 'Blended Cost'.
The copy command to Redshift only works when the header positions match the column positions in the table. However with such a dynamic file, matching column positions is not feasible. I'm exploring Dynamo DB option to overcome this issue.
What would be the best way to handle this situation? Any recommendation will be highly appreciated.

XML to Postgres via python/psycopg2

I have an existing python script that loops through a directory of XML files parsing each file using etree, and inserting data at different points into a Postgres database schema using psycopg2 module. This hacked together script worked just fine but now the amount of data (number and size of XML files) is growing rapidly, and the number of INSERT statements is just not scaling. The largest table in my final database has grown to about ~50 million records from about 200,000 XML files. So my question is, what is the most efficient way to:
Parse data out of XMLs
Assemble row(s)
Insert row(s) to Postgres
Would it be faster to write all the data to a CSV in the correct format and then bulk load the final CSV tables to Postgres using COPY_FROM command?
Otherwise I was thinking about populating some sort of temporary data structure in memory that I could insert into the DB once it reaches a certain size? I am just having trouble arriving at the specifics of how this would work.
Thanks for any insight on this topic, and please let me know if more information is needed to answer my question.

copy_from is the fastest way I found to do bulk inserts. You might be able to get away with streaming the data through a generator to stay away from writing temporary files while keeping memory usage low.
A generator function could assemble rows out of the XML data, then consume that generator with copy_from. You may even want multiple levels of generators such that you can have one which yields records from a single file and another which composes those from all 200,000 files. You'd end up with a single query which will be much faster than 50,000,000.
I wrote an answer here with links to example and benchmark code for setting something similar up.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.