Issue while copying data from local to S3 to Redshift table - python

I have written a program which generates data in csv format, then uploads that data to S3 which eventually gets copies to Redshift table. Here is the code
bucket2 = self.s3Conn.lookup('my-bucket')
k = Key(bucket2)
## Delete existing
key_del = bucket2.delete_key("test_file.csv")
## Create new key and upload file to s3
k.Key = "test_file.csv"
k.name = "test_file.csv"
k.set_contents_from_filename('test_file.csv')
## Move file from S3 to redshift
logging.info("\nFile Uploaded to S3 bucket\n")
try:
self.newCur.execute("Truncate test_file")
self.newCur.execute("COPY test_file FROM 's3://my-bucket/test_file.csv' credentials 'aws_access_key_id=xxxxxx;aws_secret_access_key=xxxxxx DELIMITER ','; ")
except psycopg2.DatabaseError, e:
logging.exception("Database exception ")
File has around 13500 lines with 10 columns.
I verified that redhshift has same number of columns and data type
But still, everytime it breaks after 13204 line with error in "stl_load_errors" table as "Delimited not found". Data in row 13204 doesnt matter as I updated that row also with other values.
So I check S3 bucket to check my csv file. I downloaded file which was copied to S3 bucket. What I see is that file is not copied entirely. It usually breaks around 811007 characters.
Earlier I have uploaded larger files into S3 without any issue.
Any idea as why is this happening?

Thanks for the help. The issue was pretty simple.
I was writing the file on my local disk using file.write() and then copying it to S3.
So before copying to S3, I needed to CLOSE the file using file.close(), which I did not do.
Yes, that's silly :)

You should look closer if there are NULL bytes 0x00 at row 13204. I have seen those in the middle of fields which cause different kinds of loading errors. To check, you can either use NULL AS '\000' option to bypass them or use a hex editor to read the file. Note that a normal editor might not show there's a null byte.

I take similar approach in my Redshift CSV upload script.
You can use it to do "sanity check" or draw performance baseline for the script you are working on.
Try CSV_Loader_For_Redshift.
Script will:
Compress and upload your file to S3
Append your data to Redshift table.
Sample output for 12Mb/50k line file:
S3 | data.csv.gz | 100%
Redshift | test2 | DONE
Time elapsed: 5.7 seconds

Related

Parsing and displaying large json files in SageMaker

I have gzipped json files in S3 and I'm trying to show them in a SageMaker Studio Notebook like so:
import boto3
import gzip
s3_object = s3.Object("bucket", "path")
with gzip.GzipFile(fileobj=s3_object.get()["Body"]) as gzip_file:
print("reading s3 object through gzip stream")
raw_json = gzip_file.read()
print("done reading s3 object, about to flush")
gzip_file.flush()
print("done flushing")
print("about to print")
print(raw_json.decode("utf-8"))
print("done printing")
My constraint is that it has to be done in memory, I've resorted to running on a ml.m5.2xlarge instance which should be more than enough.
I know of IPython.display.JSON, pandas.read_json and json.load()/json.loads(), I'm treating the content as a plain string to keep the question simple.
The (unexpected) output of the above code is:
reading s3 object through gzip stream
done reading s3 object, about to flush
done flushing
about to display
At this point the kernel status is 'Busy' and it can remain like that for a few good minutes until finally it seems to just kind of 'give up' with no output.
If I run the exact same code in a notebook running on my laptop it works just fine, quickly showing the json content.
What am I missing? Is there a better way to do this?
My endgame is to sometimes present data in a pandas DataFrame and sometimes show it as an IPython.display.JSON for conveniently viewing content
This is a known issue with Studio.How big is your json file?
I would recommend you convert it to pandas DF to view it on the console.

Azure Blobstore: How can I read a file without having to download the whole thing first?

I'm trying to figure out how to read a file from Azure blob storage.
Studying its documentation, I can see that the download_blob method seems to be the main way to access a blob.
This method, though, seems to require downloading the whole blob into a file or some other stream.
Is it possible to read a file from Azure Blob Storage line by line as a stream from the service? (And without having to have downloaded the whole thing first)
Update 0710:
In the latest SDK azure-storage-blob 12.3.2, we can also do the same thing by using download_blob.
The screenshot of the source code of download_blob:
So just provide an offset and length parameter, like below(it works as per my test):
blob_client.download_blob(60,100)
Original answer:
You can not read the blob file line by line, but you can read them as per bytes. Like first read 10 bytes of the data, next you can continue to read the next 10 to 20 bytes etc.
This is only available in the older version of python blob storage sdk 2.1.0. Install it like below:
pip install azure-storage-blob==2.1.0
Here is the sample code(here I read the text, but you can change it to use get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10) method to read stream):
from azure.storage.blob import BlockBlobService, PublicAccess
accountname="xxxx"
accountkey="xxxx"
blob_service_client = BlockBlobService(account_name=accountname,account_key=accountkey)
container_name="test2"
blob_name="a5.txt"
#get the length of the blob file, you can use it if you need a loop in your code to read a blob file.
blob_property = blob_service_client.get_blob_properties(container_name,blob_name)
print("the length of the blob is: " + str(blob_property.properties.content_length) + " bytes")
print("**********")
#get the first 10 bytes data
b1 = blob_service_client.get_blob_to_text(container_name,blob_name,start_range=0,end_range=10)
#you can use the method below to read stream
#blob_service_client.get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10)
print(b1.content)
print("*******")
#get the next range of data
b2=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=10,end_range=50)
print(b2.content)
print("********")
#get the next range of data
b3=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=50,end_range=200)
print(b3.content)
The accepted answer here may be of use to you. The documentation can be found here.

writing to pysftp fileobject using pandas to_csv with compression doesn't actually compress

I have looked at many related answers here on Stackoverflow and this question seems most related How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python?. I want to do something similar, however, I want to compress the file when I send it to the SFTP location, so I end up with a .csv.gz file essentially. The files I am working with are 15-40 MB in size uncompressed, but there are lots of them sometimes, so need to keep the fingerprint small.
I have been using code like this to move the dataframe to the destination, after pulling it from another location as a csv, doing some transformations on the data itself:
fileList = source_sftp.listdir('/Inbox/')
dataList = []
for item in fileList: # for each file in the list...
print(item)
if item[-3:] == u'csv':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item)) # read the csv directly from the sftp server into a pd Dataframe
elif item[-3:] == u'zip':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='zip')
elif item[-3:] == u'.gz':
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='gzip')
else:
temp = pd.read_csv(source_sftp.open('/Inbox/'+item),compression='infer')
dataList.append(temp) # keep each
#... Some transformations in here on the data
FL = [(x.replace('.csv',''))+suffix # just swap out to suffix
for x in fileList]
locpath = '{}/some/new/dir/'.format(dest_sftp.pwd)
i = 0
for item in dataList:
with dest_sftp.open(locpath + FL[i], 'w') as f:
item.to_csv(f, index=False,compression='gzip')
i = i+1
It seems like I should be able to get this to work, but I am guessing there is something being skipped over when I use to_csv to convert the dataframe back and then compress it on the sftp fileobject. Should I be streaming this somehow, or is there solution I am missing somewhere in the documentation on pysftp or pandas?
If I can avoid saving the csv file somewhere local first, I would like to, but I don't think I should have to, right? I am able to get the file in the end to be compressed if I just save file locally with temp.to_csv('/local/path/myfile.csv.gz', compression='gzip'), and after transferring this local file to the destination it is still compressed, so I don't think it has do with the transfer, just how pandas.Dataframe.to_csv and the pysftp.Connection.open are used together.
I should probably add that I still consider myself a newbie to much of Python, but I have been working with local to sftp and sftp to local, and have not had to do much in the way of transferring (directly or indirectly) between them.
Make sure you have the latest version of Pandas.
It supports the compression with a file-like object since 0.24 only:
GH21227: df.to_csv ignores compression when provided with a file handle

Spark df partitioniong after partioning by yy/mm/dd

S3 hosts a very large compressed file (20gb compressed -> 200gb uncompressed).
I want to read this file in (unfortunately decompress on single core), transform some sql columns, and then output to S3 with the s3_path/year=2020/month=01/day=01/[files 1-200].parquet format.
The entirety of the file will comprised of data from the same date. This leads me to believe instead of using partitionBy('year','month','day') I should append "year={year}/month={month}/day={day}/" to the s3 path, because currently spark is writing a single file at a time to s3 (1gb size each). Is my thinking correct?
Here is what I'm doing currently:
df = df\
.withColumn('year', lit(datetime_object.year))\
.withColumn('month', lit(datetime_object.month))\
.withColumn('day', lit(datetime_object.day))
df\
.write\
.partitionBy('year','month','day')\
.parquet(s3_dest_path, mode='overwrite')
What I'm thinking:
df = spark.read.format('json')\
.load(s3_file, schema=StructType.fromJson(my_schema))\
.repartition(200)
# currently takes a long time decompressing the 20gb s3_file.json.gz
# transform
df.write\
.parquet(s3_dest_path + 'year={}/month={}/day={}/'.format(year,month,day))
You're probably running into the problem that spark writes data first to some _temporary directory and only then commit it to the final location. In HDFS this is done by rename. However S3 does not support renames, but instead copies the data fully (only using one executor). For more on this topic see for example this post: Extremely slow S3 write times from EMR/ Spark
Common work-around is to write to hdfs and then use distcp to copy distributed from hdfs to s3

JSON format validation without loading file

How to validate JSON format without loading the file? I am copying files from one S3 bucket to another S3 bucket. After JSONL files are
copied , I want to check if file format is correct in the sense curly braces and commas are fine.
I don't want to use json.load() because file size and number are big and it will slow down the process plus file is already copied so no need to parse it , just validation is requirement.
There is no capability within Amazon S3 itself to validate the content of objects.
You could configure S3 to trigger an AWS Lambda function whenever a file is created in the S3 bucket. The Lambda function could then parse the file and perform some action (eg send a notification or move the object to another location) if the validation fails.
Streaming the file seems to be the way to go about it, put into a generator and yield line by line to check if the JSON is valid. The requests library supports streaming of a file.
The solution would look something like this:
import requests
def get_data():
r = requests.get('s3_file_url', stream=True)
yield from r.iter_lines():
def parse_data():
# initialize generator
gd_gen = get_data()
while True:
try:
ge_gen.__next__()
except StopIteration:
break
# put your validation code here
Let me know you need a better clarification

Categories

Resources