convert docx to pdf using pypandoc with BytesIO file path - python

I want to get docx file from azure blob storage, convert it into pdf and save it again into azure blob storage. I want to use pypandoc to convert docx to pdf.
pypandoc.convert_file('abc.docx', format='docx', to='pdf',outputfile='abc.pdf')
But, I want to run this code in azure function where I will not get enough space to save files, hence I am downloading file from azure blob storage using BytesIO as a stream as follows.
blob_service_client = BlobServiceClient.from_connection_string(cs)
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
now I want to convert my docx file which is accessible using stram to pdf. converted pdf also savable as BytesIO stream so could upload it into blob storage without taking system memory. but pypandoc showing error as RuntimeError: source_file is not a valid path
if you could suggest some other way to convert docx to pdf which could handle BytesIO file format, then I like to mention I will work in linux environment where library like doc2pdf does not support.

Related

Read file from S3 into Python memory

I have a large csv file stored in S3, I would like to download, edit and reupload this file without it ever touching my hard drive, i.e. read it straight into memory from S3. I am using the python library boto3, is this possible?
You should look into the io module
Depending on how you want to read the file, you can create a StringIO() or BytesIO() object and download your file to this stream.
You should check out these answers:
How to read image file from S3 bucket directly into memory?
How to read a csv file from an s3 bucket using Pandas in Python

Convert s3 video files resolution and store it back to s3

I want to convert s3 bucket video file resolutions to other resolutions and store it back to s3.
I know we can use ffmpeg locally to change the resolution.
Can we use ffmpeg to convert s3 files and store it back using Django?
I am puzzled as from where to start and what should be the process.
Whether I should have the video file from s3 bucket in buffer and then convert it using ffmpeg and then upload it back to s3.
Or is there anyway to do it directly without having to keep it in buffer.
You can break the problem into 3 parts:
Get the file from s3: use boto3 & the AWS APIs to download the file
Convert the file locally: ffmpeg has got you covered here. It'll generate a new file
Upload the file back to s3: use boto3 again
This is fairly simple and robust for a script.
Now if you want to optimize you can try using something like s3fs to mount your s3 bucket to local & do the conversion

Azure storage turn off automatic decompression

Hi I'm using python with the azure-sdk to download files from a storage blob. The following code is what I use.
BLOB_SERVICE = BlockBlobService(account_name=AZURE_BLOB_SERVICE_ACCOUNT_NAME, account_key=AZURE_BLOB_SERVICE_ACCOUNT_KEY)
cloud_globals.BLOB_SERVICE.get_blob_to_path(
guid,
name,
path,
)
The download works but Azure or the SDK decompresses my gzipped files on the fly when the files are fetched. I need the files to be gzipped and I would prefer to download the files as they are on the storage. Is there any way to turn off this behaviour?
In my experience, your issue is due to your blob properties. You could check it on the portal and need to set the Content_Encoding = NULL.
I tested your code and the gz file could be downloaded normally.
If I set the Content_Encoding = gzip which is corresponding to my file foramt, the gz file will be decompressed when the file is fetched as same as you. You could refer to this doc.

Python script to use data from Azure Storage Blob by stream, and update blob by stream without local file reading and uploading

I have a python code for data processing , i want to use azure block blob as the data input for the code, to be specify, a csv file from block blob. its all good to download the csv file from azure blob to local path, and upload other way around for the python code if running locally, but the problem is my code running on azure virtual machine because its quite heavy for my Apple Air , pandas read_csv from local path does not work in this case, therefore i have to download and upload and update csv files to azure storage by stream without local saving. both download and upload csv are quite small volume, much less than the blob block limits
there wasn't that much tutorial to explain how to do this step by step, MS Docs are generally suck to explain as well, my minimal code are as following:
for downloading from azure blob storage
from azure.storage.blob import BlockBlobService
storage = BlockBlobService(account_name='myname', account_key = 'mykey')
#here i don't know how to make a csv stream that could could be used in next steps#
file = storage.get_blob_to_stream('accountname','blobname','stream')
df = pd.read_csv(file)
#df for later steps#
for uploading and updating a blob by a dataframe from code by stream
df #dataframe generated by code
'i don't know how to do the preparation steps for df and the final fire up operation'
storage.put_blob_to_list _by_stream('accountname','blobname','stream')
can you please make a step by step tutorial for me, for ppl has experience to azure blob , this should be not very difficult.
or if you have better solution other than use blob for my case , please drop some hits. Thanks.
So the document is still in progress, I think it is getting better and better...
Useful link:
Github - Microsoft Azure Storage SDK for Python
Quickstart: Upload, download, and list blobs using Python
To download a file as a stream from blob storage, you can use BytesIO:
from azure.storage.blob import BlockBlobService
from io import BytesIO
from shutil import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my_account_name', account_key='my_account_key')
# Download as a stream
block_blob_service.get_blob_to_stream('mycontainer', 'myinputfilename', input_blob)
# Do whatever you want to do - here I am just copying the input stream to the output stream
copyfileobj(input_blob, output_blob)
...
# Create the a new blob
block_blob_service.create_blob_from_stream('mycontainer', 'myoutputfilename', output_blob)
# Or update the same blob
block_blob_service.create_blob_from_stream('mycontainer', 'myinputfilename', output_blob)

Transfering data from gcs to s3 with google-cloud-storage

I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.
I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?

Categories

Resources