Read multiple files from s3 faster in python - python

I have multiple files in s3 bucket folder. In python I read the files one by one and used concat for single dataframe. However, it is pretty slow. If I have a million of files then it will be extremely slow. Is there any other method available (like bash) that can increase the process of reading s3 files?
response = client.list_objects_v2(
Bucket='bucket',
Prefix=f'key'
)
dflist = []
for obj in response.get('Contents', []):
dflist.append(get_data(obj,col_name))
pd.concat(dflist)
def get_data(obj, col_name):
data = pd.read_csv(f's3://bucket/{obj.get("Key")}', delimiter='\t', header=None, usecols=col_name.keys(),
names=col_name.values(), error_bad_lines=False)
return data

As s3 is object storage you need to bring your file on the computer (ie read a file in memory) and edit it and then push back again (rewrite object).
So it took some time to achieve your task.
some helper pointers:
If you process multiple files in multiple threads that will help you.
If your data is really heavy, start an instance on aws in the same region where your bucket is, and process data from there and terminate it. (it will save network cost + time to pull and push files across networks)

You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores. It's read_csv can also read multiple csv files from an S3 folder.
import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/folder/")
It can be installed via pip install awswrangler.

Related

Get Metadata from S3 parquet file using Pyarrow

I have a parquet file in s3 to which I will be automatically appending additional data every week. The data has timestamps at 5-minute intervals. I do not want to append any duplicate data during my updates, so what I am trying to accomplish is read ONLY the max/oldest timestamp within the data saved in s3. Then, I will make sure that all of the timestamps in the data I will be appending are older than that time before appending. I don't want to read the entire dataset from s3 in an effort to increase speed/preserve memory as the dataset continues to grow.
Here is an example of what I am doing now to read the entire file:
from pyarrow import fs
import pyarrow.parquet as pq
s3, path = fs.S3FileSystem(access_key, secret_key).from_uri(uri)
dataset = pq.ParquetDataset(path, filesystem=s3)
table = dataset.read()
But I am looking for something more like this (I am aware this isn't correct, but hopefully it conveys what I am attempting to accomplish):
max_date = pq.ParquetFile(path, filesystem=s3).metadata.row_group(0).column('timestamp').statistics['max']
I am pretty new to using both Pyarrow and AWS, so any help would be fantastic (including alternate solutions to my problem I described).
From a purely pedantic perspective I would phrase the problem statement a little differently as "I have a parquet dataset in S3 and will be appending new parquet files on a regular basis". I only mention that because the pyarrow documentation is written with that terminology in mind (e.g. you cannot append to a parquet file with pyarrow but you can append to a parquet dataset) and so it might help understanding.
The pyarrow datasets API doesn't have any operations to retrieve dataset statistics today (it might not be a bad idea to request the feature as a JIRA). However, it can help a little in finding your fragments. What you have doesn't seem that far off to me.
s3, path = fs.S3FileSystem(access_key, secret_key).from_uri(uri)
# At this point a call will be made to S3 to list all the files
# in the directory 'path'
dataset = pq.ParquetDataset(path, filesystem=s3)
max_timestamp = None
for fragment in dataset.get_fragments():
field_index = fragment.physical_schema.get_field_index('timestamp')
# This will issue a call to S3 to load the metadata
metadata = fragment.metadata
for row_group_index in range(metadata.num_row_groups):
stats = metadata.row_group(row_group_index).column(field_index).statistics
# Parquet files can be created without statistics
if stats:
row_group_max = stats.max
if max_timestamp is None or row_group_max > max_timestamp:
max_timestamp = row_group_max
print(f"The maximum timestamp was {max_timestamp}")
I've annotated the places where actual calls to S3 will be made. This will certainly be faster than loading all of the data but there is still going to be some overhead which will grow as you add more files. This overhead could get quite high if you are running outside of the AWS region. You could mitigate this by scanning the fragments in parallel but that will be extra work.
It would be faster to store the max_timestamp in a dedicated statistics file whenever you update the the data in your dataset. That way there is only ever one small file you need to read. If you're managing the writes yourself you might look into a table format like Apache Iceberg which is a standard format for storing this kind of extra information and statistics about a dataset (what Arrow calls a "dataset" Iceberg calls a "table").

Logging a PySpark dataframe into a MLFlow Artifact

I am currently writing an MLFlow artifact to the dbfs but I am using pandas using the code below...
temp = tempfile.NamedTemporaryFile(prefix="*****", suffix=".csv")
temp_name = temp.name
try:
df.to_csv(temp_name, index=False)
mlflow.log_artifact(temp_name, "******")
finally:
temp.close() # Delete the temp file
How would I write this if 'df' was a spark dataframe?
You just need to use the filepath URLs with proper protocols. "dbfs" is a generic Databricks one. For Azure, "abffs" would be needed. (I can not recall AWS's S3 extension)
filepath="dbfs:///filepath"
df # My Spark DataFrame
df.write.csv(filepath)
mlflow.log_artifact(temp_name, filepath)
It looks like in your case the problem has to do with how spark APIs accesses the filesystem vs how python APIs accesses it see here for details. This is probably not the recommended way (I'm fairly new to Databricks myself), but if you're on a single node you can write your parquet to the local filesystem and mlflow can log it from there with something like:
with tempfile.TemporaryDirectory() as tmpdirname:
df.write.parquet(f'file:{tmpdirname}/my_parquet_table')
mlflow.log_artifacts(tmpdirname, artifact_path='my_parquet_table_name')
Keep in mind that a parquet "file" is actually a directory with a whole bunch of files in it, so you need to use log_artifacts, not log_artifact, and if you don't specify artifact_path you'll get all the little files that make up the parquet file (directory) dumped directly into the root of your your mlflow artifacts. Also, Mlflow doesn't have any previewing capability for parquet files, so depending on your use case, logging parquet artifacts may not be as convenient as it first seems.
HTH

Is it possible to do filebased processing with UDF in pyspark?

I have a UDF defined which does the following with a dataframe where a column contains the location of zip files in azure blob storage(I tested the UDF without spark and that worked out):
downloads defined file from the blob and safe it somewhere on the Excutor/Driver
Extract a certain file of the blob and safe it on the Excutor/Driver
With this UDF I experience it is the same speed as if I would just loop in python over the files. So is it even possible to do this kind of task in spark? I wanted to use spark to parallelize the download and unzipping to speed it up.
I connected via ssh to the Excutor and the Driver (it is a test cluster, so it only has one of each) and found out that only the data was processe on the Excutor and the driver did not do anything at all. Why is that so?
The next step would be to read the extracted files (normal csvs) to a spark data frame. But how can this be done if the files are distributed over the Excutor and Driver? I did not yet find a way to access the storage of the Excutors. Or is it somehow possible to define a common location within the UDF to write it back to a location at the driver?
I would like to read than the extracted files with:
data_frame = (
spark
.read
.format('csv')
.option('header', True)
.option('delimiter', ',')
.load(f"/mydriverpath/*.csv"))
If there is another method to parallelize the download and unzipping of the files I would be happy to hear about it.
PySpark readers / writers make it easy to read and write files in parallel. When working in Spark, you generally should not loop over files or save data on the driver node.
Suppose you have 100 gzipped CSV files in the my-bucket/my-folder directory. Here's how you can read them into a DataFrame in parallel:
df = spark.read.csv("my-bucket/my-folder")
And here's how you can write them to 50 Snappy compressed Parquet files (in parallel):
df.repartition(50).write.parquet("my-bucket/another-folder")
The readers / writers do all the heavy lifting for you. See here for more info about repartition.

Writing .txt files to GCS from Spark/Dataproc: How to write only one large file instead of it automatically splitting in to multiple?

I use Dataproc to run a Pyspark script that writes a dataframe to text files in google cloud storage bucket. When I run the script with big data, I automatically end up with a large number of text files in my output folder, but I want only one large file.
I read here Spark saveAsTextFile() writes to multiple files instead of one I can use .repartition(1) before .write() to get one file but I want it to run fast (of course) so I don't want to go back to one partition before performing the .write().
df_plain = df.select('id', 'string_field1').write.mode('append').partitionBy('id').text('gs://evatest/output', compression="gzip")
Don't think of GCS as a filesystem. The content of a GCS bucket is a set of immutable blobs (files). Once written, they can't be changed. My recommendation is to let your job write all the files independently and aggregate them at the end. There are a number of ways to achieve this.
The easiest way to achieve this is through the gsutil compose command.
References:
How to concatenate sharded files on Google Cloud Storage automatically using Cloud Functions
compose - Concatenate a sequence of objects into a new composite object.
Google Cloud Storage Joining multiple csv files

Increase read from s3 performance of lambda code

I am reading a large json file from s3 bucket. The lambda gets called a few hundred times in a second. When the concurrency is high, the lambdas start timing out.
Is there a more efficient way of writing the below code, where I do not have to download the file every time from S3 or reuse the content in memory across different instances of lambda :-)
The contents of the file change only once in a week!
I cannot split the file (due to the json structure) and it has to be read at once.
s3 = boto3.resource('s3')
s3_bucket_name = get_parameter('/mys3bucketkey/')
bucket = s3.Bucket(s3_bucket_name)
try:
bucket.download_file('myfile.json', '/tmp/' + 'myfile.json')
except:
print("File to be read is missing.")
with open(r'/tmp/' + 'myfile.json') as file:
data = json.load(file)
Probably, you don't reach the request rate limit https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html but worth trying to copy the same S3 file with another prefix.
One of possible solution is to avoid querying S3 by putting the JSON file into the function code. Additionally, you may want to add it as a Lambda layer and load from /opt from your Lambda: https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html In this case you can automate the function update when the s3 file is updated by adding another lambda that will be triggered by the S3 update and call https://docs.aws.amazon.com/lambda/latest/dg/API_UpdateFunctionCode.html
As a long-term solution, check Fargate https://aws.amazon.com/fargate/getting-started/ with which you can build a low latency container-based services and put the file into a container.
When the Lambda function executes, it could check for the existence of the file in /tmp/ since the container might be re-used.
If it is not there, the function can download it.
If the file is already there, then there is no need to download it. Just use it!
However, you'll have to figure out how to handle the weekly update. Perhaps a change of filename based on date? Or check the timestamp on the file to see whether a new one is needed?

Categories

Resources