I have a parquet file in s3 to which I will be automatically appending additional data every week. The data has timestamps at 5-minute intervals. I do not want to append any duplicate data during my updates, so what I am trying to accomplish is read ONLY the max/oldest timestamp within the data saved in s3. Then, I will make sure that all of the timestamps in the data I will be appending are older than that time before appending. I don't want to read the entire dataset from s3 in an effort to increase speed/preserve memory as the dataset continues to grow.
Here is an example of what I am doing now to read the entire file:
from pyarrow import fs
import pyarrow.parquet as pq
s3, path = fs.S3FileSystem(access_key, secret_key).from_uri(uri)
dataset = pq.ParquetDataset(path, filesystem=s3)
table = dataset.read()
But I am looking for something more like this (I am aware this isn't correct, but hopefully it conveys what I am attempting to accomplish):
max_date = pq.ParquetFile(path, filesystem=s3).metadata.row_group(0).column('timestamp').statistics['max']
I am pretty new to using both Pyarrow and AWS, so any help would be fantastic (including alternate solutions to my problem I described).
From a purely pedantic perspective I would phrase the problem statement a little differently as "I have a parquet dataset in S3 and will be appending new parquet files on a regular basis". I only mention that because the pyarrow documentation is written with that terminology in mind (e.g. you cannot append to a parquet file with pyarrow but you can append to a parquet dataset) and so it might help understanding.
The pyarrow datasets API doesn't have any operations to retrieve dataset statistics today (it might not be a bad idea to request the feature as a JIRA). However, it can help a little in finding your fragments. What you have doesn't seem that far off to me.
s3, path = fs.S3FileSystem(access_key, secret_key).from_uri(uri)
# At this point a call will be made to S3 to list all the files
# in the directory 'path'
dataset = pq.ParquetDataset(path, filesystem=s3)
max_timestamp = None
for fragment in dataset.get_fragments():
field_index = fragment.physical_schema.get_field_index('timestamp')
# This will issue a call to S3 to load the metadata
metadata = fragment.metadata
for row_group_index in range(metadata.num_row_groups):
stats = metadata.row_group(row_group_index).column(field_index).statistics
# Parquet files can be created without statistics
if stats:
row_group_max = stats.max
if max_timestamp is None or row_group_max > max_timestamp:
max_timestamp = row_group_max
print(f"The maximum timestamp was {max_timestamp}")
I've annotated the places where actual calls to S3 will be made. This will certainly be faster than loading all of the data but there is still going to be some overhead which will grow as you add more files. This overhead could get quite high if you are running outside of the AWS region. You could mitigate this by scanning the fragments in parallel but that will be extra work.
It would be faster to store the max_timestamp in a dedicated statistics file whenever you update the the data in your dataset. That way there is only ever one small file you need to read. If you're managing the writes yourself you might look into a table format like Apache Iceberg which is a standard format for storing this kind of extra information and statistics about a dataset (what Arrow calls a "dataset" Iceberg calls a "table").
Related
So I have a a lot of data in an Azure blob storage. Each user can upload some cases and the end result can be represented as a series of panda dataframes. Now I want to be able to display some of this data on our site, but the files are several hundreds of MB and there is no need to download all of it. What would be the best way to get part of the df?
I can make a folder structure in each blob storage containing the different columns in each df and perhaps a more more compact summery of the columns but I would like to keep it in one file if possible.
I could also set up a database containing the info but I like the structure as it is - completely separated in cases.
Originally I thought I could do it in hdf5 but it seems that I need to download the entire file from the blob storage to my API backend before I can run my python code on it. I would prefer if I could keep the hdf5 files and get the parts of the columns from the blob storage directly but as far as I can see that is not possible.
I am thinking this is something that has been solved a million times before but it is a bit out of my domain so I have not been able to find a good solution for it.
Check out the BlobClient of the Azure Python SDK. The download_blob method might suit your needs. Use chunks() to get an iterator which allows you to iterate of over the file in chunks. You can also set other parameters to assure that a chunk doesn't exceed a set size.
I'm working on a project that needs to update a CSV file with user info periodically. The CSV is stored in an S3 bucket so I'm assuming I would use boto3 to do this. However, I'm not exactly sure how to go about this- would I need to download the CSV from S3 and then append to it, or is there a way to do it directly? Any code samples would be appreciated.
Ideally this would be something where DynamoDB would work pretty well (as long as you can create a hash key). Your solution would require the following.
Download the CSV
Append new values to the CSV Files
Upload the CSV.
A big issue here is the possibility (not sure how this is planned) that the CSV file is updated multiple times before being uploaded, which would lead to data loss.
Using something like DynamoDB, you could have a table, and just use the put_item api call to add new values as you see fit. Then, whenever you wish, you could write a python script to scan for all the values and then write a CSV file however you wish!
I wanted to know if there is a possibility to create a single table containing all the JSON files from an s3 bucket, I've searched a lot and I can't find a solution for this, if anyone can help with any tips I'd appreciate it.
Yes it is possible but it is not clear what your intent is. If you have a bucket with a set of json files that are in a Redshift readable format and have common data that can be mapped into columns, then this is fairly straight forward. The COPY command can read all the files in the bucket and apply a common mapping to the tables columns. Is this what you want?
Or do you have a bunch of dissimilar json files in various structures that you want to load some information from each into a Redshift table? Then you will likely want to use a Glue Crawler to inventory the jsons and load them separately into Redshift and then combine the common information into a single Redshift table.
Plus there are many other possibilities of what you need. The bottom line is that you are asking to load many unstructured files into a structured database. There is some mapping that needs to happen but depending on what your data looks like this can be fairly simple or quite complex.
Situation: A csv lands into AWS S3 every month. The vendor adds/removes/modifies columns from the file as they please. So the schema is not known ahead of time. The requirement is to create a table on-the-fly in Snowflake and load the data into said table. Matillion is our ELT tool.
This is what I have done so far.
Setup a Lambda to detect the arrival of the file, convert it to JSON, upload to another S3 dir and adds filename to SQS.
Matillion detects SQS message and loads the file with the JSON Data into Variant column in a SF table.
SF Stored proc takes the variant column and generates a table based on the number of fields in the JSON data. The VARIANT column in SF only works in this way if its JSON data. CSV is sadly not supported.
This works with 10,000 rows. The problem arises when I run this with a full file which is over 1GB, which is over 10M rows. It crashes the lambda job with an out of disk space error at runtime.
These are the alternatives I have thought of so far:
Attach an EFS volume to the lambda and use it to store the JSON file prior to the upload to S3. JSON data files are so much larger than their CSV counterparts, I expect the json file to be around 10-20GB since the file has over 10M rows.
Matillion has an Excel Query component where it can take the headers and create a table on the fly and load the file. I was thinking I can convert the header row from the CSV into a XLX file within the Lambda, pass it to over to Matillion, have it create the structures and then load the csv file once the structure is created.
What are my other options here? Considerations include a nice repeatable design pattern to be used for future large CSVs or similar requirements, costs of the EFS, am I making the best use of the tools that I are avaialable to me? Thanks!!!
Why not split the initial csv file into multiple files and then process each file in the same way you currently are?
Why are you converting CSV into JSON; CSV is directly being loaded into table without doing any data transformation specifically required in case of JSON, the lateral flatten to convert json into relational data rows; and why not use Snowflake Snowpipe feature to load data directly into Snowflake without use of Matallion. You can split large csv files into smaller chunks before loading into Snowflake ; this will help in distributing the data processing loads across SF Warehouses.
I also load CSV files from SFTP into Snowflake, using Matillion, with no idea of the schema.
In my process, I create a "temp" table in Snowflake, with 50 VARCHAR columns (Our files should never exceed 50 columns). Our data always contains text, dates or numbers, so VARCHAR isn't a problem. I can then load the .csv file into the temp table. I believe this should work for files coming from S3 as well.
That will at least get the data into Snowflake. How to create the "final" table however, given your scenario, I'm not sure.
I can imagine being able to use the header row, and/or doing some analysis on the 'type' of data contained in each column, to determine the column type needed.
But if you can get the 'final' table created, you could move the data over from temp. Or alter the temp table itself.
This can be achieved using an external table where the external table will be mapped with a single column and the delimiter will be a new line character. The external table also has a special virtual column and that can be processed to extract all the columns dynamically and then create a table based on the number of columns at any given time using the stored procedure. There is an interesting video which talks about this limitation in snowflake (https://youtu.be/hRNu58E6Kmg)
I have multiple files in s3 bucket folder. In python I read the files one by one and used concat for single dataframe. However, it is pretty slow. If I have a million of files then it will be extremely slow. Is there any other method available (like bash) that can increase the process of reading s3 files?
response = client.list_objects_v2(
Bucket='bucket',
Prefix=f'key'
)
dflist = []
for obj in response.get('Contents', []):
dflist.append(get_data(obj,col_name))
pd.concat(dflist)
def get_data(obj, col_name):
data = pd.read_csv(f's3://bucket/{obj.get("Key")}', delimiter='\t', header=None, usecols=col_name.keys(),
names=col_name.values(), error_bad_lines=False)
return data
As s3 is object storage you need to bring your file on the computer (ie read a file in memory) and edit it and then push back again (rewrite object).
So it took some time to achieve your task.
some helper pointers:
If you process multiple files in multiple threads that will help you.
If your data is really heavy, start an instance on aws in the same region where your bucket is, and process data from there and terminate it. (it will save network cost + time to pull and push files across networks)
You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores. It's read_csv can also read multiple csv files from an S3 folder.
import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/folder/")
It can be installed via pip install awswrangler.