Spark df partitioniong after partioning by yy/mm/dd

Spark df partitioniong after partioning by yy/mm/dd - python

S3 hosts a very large compressed file (20gb compressed -> 200gb uncompressed).
I want to read this file in (unfortunately decompress on single core), transform some sql columns, and then output to S3 with the s3_path/year=2020/month=01/day=01/[files 1-200].parquet format.
The entirety of the file will comprised of data from the same date. This leads me to believe instead of using partitionBy('year','month','day') I should append "year={year}/month={month}/day={day}/" to the s3 path, because currently spark is writing a single file at a time to s3 (1gb size each). Is my thinking correct?
Here is what I'm doing currently:
df = df\
.withColumn('year', lit(datetime_object.year))\
.withColumn('month', lit(datetime_object.month))\
.withColumn('day', lit(datetime_object.day))
df\
.write\
.partitionBy('year','month','day')\
.parquet(s3_dest_path, mode='overwrite')
What I'm thinking:
df = spark.read.format('json')\
.load(s3_file, schema=StructType.fromJson(my_schema))\
.repartition(200)
# currently takes a long time decompressing the 20gb s3_file.json.gz
# transform
df.write\
.parquet(s3_dest_path + 'year={}/month={}/day={}/'.format(year,month,day))

You're probably running into the problem that spark writes data first to some _temporary directory and only then commit it to the final location. In HDFS this is done by rename. However S3 does not support renames, but instead copies the data fully (only using one executor). For more on this topic see for example this post: Extremely slow S3 write times from EMR/ Spark
Common work-around is to write to hdfs and then use distcp to copy distributed from hdfs to s3

Related

Write pyspark binary column to S3 PDF files (AWS glue job)

I have a pyspark.sql.dataframe sourcing some parquet-files which contains a column with the dataformat binary, it holds one PDF-file per row. Currently, i can write them locally by calling write_documents:
# full_path includes name of file and its suffix (.pdf)
def write_document_locally(full_path: str, byte_file: bytearray):
with open(full_path, "wb") as f:
f.write(byte_file)
def write_documents(data_frame: sql.DataFrame) -> None:
[
write_document_locally(full_path=full_path, byte_file=byte_file)
for full_path, byte_file in zip(
data_frame["file_path_and_name"], data_frame["byte_file"]
)
]
From the same job I'm also writing a parquet-table to a separate location. Both folders that are created including the resulting PDF/parquet-files are partitioned by year and id. In the PDF-case i partition by manually concatenating year=XXXX/id=XX to the full_path, in the parquet-case i use:
data_frame.write.mode("overwrite").partitionBy("year", "id").parquet(path=another_path)
To replicate the PDF-export in AWS and writing it to a S3-bucket instead, i would have to use boto3. I'm wondering whether there is a more efficient way of doing this using data_frame.write instead.
The problems with using boto3 is 1) I will write the pdf locally in one driver before uploading it to S3 which is inefficient and gathers all data in one driver (i think), 2) it would not create partitions automatically for me.

How can I import many binary files in Dask?

I have many binary files (.tdms format, similar to .wav) stored in S3 and I would like to read them with nptdms then process them in a distributed fashion with Dask on a cluster.
In PySpark there is pyspark.SparkContext.binaryFiles() which produces an RDD with a bytearray for each input file which is a simple solution to this problem.
I have not found an equivalent function in Dask - is there one? If not, how could the equivalent functionality be achieved in Dask?
I noticed there's dask.bytes.read_bytes() if it's necessary to involve this however nptdms can't read a chunk of a file - it needs the entire file to be available and I'm not sure how to accomplish that.

dask.bytes.read_bytes() will give you whole files if you use blocksize=None, i.e., exactly one block per file. The most common use case for that is compressed files (e.g., gzip) where you can't start mid-stream, but should work for your use case too. Note that the delayed objects you get each return bytes, not open files.
Alternatively, you can use fsspec.open_files. This returns OpenFile objects, which are safe to serialise and so you can use them in dask.delayed calls such as
ofs = fsspec.open_files("s3://...", ...)
#dask.delayed
def read_a_file(of):
with of as f:
# entering context actually touches storage
return TdmsFile.read(f)
tdms = [read_a_file(of) for of in ofs]

AWS Lambda: read csv file dimensions from an s3 bucket with Python without using Pandas or CSV package

good afternoon. I am hoping that someone can help me with this issue.
I have multiple CSV files that are sitting in an s3 folder. I would like to use python without the Pandas, and the csv package (because aws lambda has very limited packages available, and there is a size restriction) and loop through the files sitting in the s3 bucket, and read the csv dimensions (length of rows, and length of columns)
For example my s3 folder contains two csv files (1.csv, and 2 .csv)
my code will run through the specified s3 folder, and put the count of rows, and columns in 1 csv, and 2 csv, and puts the result in a new csv file. I greatly appreciate your help! I can do this using the Pandas package (thank god for Pandas, but aws lambda has restrictions that limits me on what I can use)
AWS lambda uses python 3.7

If you can visit your s3 resources in your lambda function, then basically do this to check the rows,
def lambda_handler(event, context):
import boto3 as bt3
s3 = bt3.client('s3')
csv1_data = s3.get_object(Bucket='the_s3_bucket', Key='1.csv')
csv2_data = s3.get_object(Bucket='the_s3_bucket', Key='2.csv')
contents_1 = csv1_data['Body'].read()
contents_2 = csv2_data['Body'].read()
rows1 = contents_1.split()
rows2=contents_2.split()
return len(rows1), len(rows2)
It should work directly, if not, please let me know. BTW, hard coding the bucket and file name into the function like what I did in the sample is not a good idea at all.
Regards.

Hitting AWS Lambda Memory limit in Python

I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf

Issue while copying data from local to S3 to Redshift table

I have written a program which generates data in csv format, then uploads that data to S3 which eventually gets copies to Redshift table. Here is the code
bucket2 = self.s3Conn.lookup('my-bucket')
k = Key(bucket2)
## Delete existing
key_del = bucket2.delete_key("test_file.csv")
## Create new key and upload file to s3
k.Key = "test_file.csv"
k.name = "test_file.csv"
k.set_contents_from_filename('test_file.csv')
## Move file from S3 to redshift
logging.info("\nFile Uploaded to S3 bucket\n")
try:
self.newCur.execute("Truncate test_file")
self.newCur.execute("COPY test_file FROM 's3://my-bucket/test_file.csv' credentials 'aws_access_key_id=xxxxxx;aws_secret_access_key=xxxxxx DELIMITER ','; ")
except psycopg2.DatabaseError, e:
logging.exception("Database exception ")
File has around 13500 lines with 10 columns.
I verified that redhshift has same number of columns and data type
But still, everytime it breaks after 13204 line with error in "stl_load_errors" table as "Delimited not found". Data in row 13204 doesnt matter as I updated that row also with other values.
So I check S3 bucket to check my csv file. I downloaded file which was copied to S3 bucket. What I see is that file is not copied entirely. It usually breaks around 811007 characters.
Earlier I have uploaded larger files into S3 without any issue.
Any idea as why is this happening?

Thanks for the help. The issue was pretty simple.
I was writing the file on my local disk using file.write() and then copying it to S3.
So before copying to S3, I needed to CLOSE the file using file.close(), which I did not do.
Yes, that's silly :)

You should look closer if there are NULL bytes 0x00 at row 13204. I have seen those in the middle of fields which cause different kinds of loading errors. To check, you can either use NULL AS '\000' option to bypass them or use a hex editor to read the file. Note that a normal editor might not show there's a null byte.

I take similar approach in my Redshift CSV upload script.
You can use it to do "sanity check" or draw performance baseline for the script you are working on.
Try CSV_Loader_For_Redshift.
Script will:
Compress and upload your file to S3
Append your data to Redshift table.
Sample output for 12Mb/50k line file:
S3 | data.csv.gz | 100%
Redshift | test2 | DONE
Time elapsed: 5.7 seconds

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark df partitioniong after partioning by yy/mm/dd - python

Related

Write pyspark binary column to S3 PDF files (AWS glue job)

How can I import many binary files in Dask?

AWS Lambda: read csv file dimensions from an s3 bucket with Python without using Pandas or CSV package

Hitting AWS Lambda Memory limit in Python

Issue while copying data from local to S3 to Redshift table

Categories

Resources