Check if a file in SharePoint has finished uploading - python

I am building a pipeline which SharePoint will be the source for data ingestion. I want to use Azure LogicApps with a trigger to run when a file is created or modified. When a file is uploaded to SharePoint, LogicApps should copy the file to Blob Storage. I am facing a problem which the trigger can happen even if the file is not 100% uploaded yet, which leads to copying empty or incomplete files.
I tried several SharePoint triggers to see if it's only a problem with one of them but they all have the same issue.
I decided to use Python with Office365-REST-Python-Client deployed in Azure Functions to handle copying the files to Azure Blob Storage. I have the following code:
def download_file(context, sharepoint_file_path, local_file_path):
response = File.open_binary(context, sharepoint_file_path)
response.raise_for_status()
with open(local_file_path, 'wb') as f:
f.write(response.content)
I checked the response's status_code and even for incomplete files it returns 200 which still does not help with checking if the file is complete.
How can I solve this? Thank you.

In my opinion, it is difficult for us to check if the file in sharepoint has finished uploading. But we can avoid the situation of the file hasn't been finished uploading by some workaround.
You need to estimate how long the uploading process will take according to the size of your files, and then add a "Delay" action after the trigger.
For example, it delay 5 minutes in the screenshot above. After 5 minutes, the file complete the upload operation and will be copied to blob storage successful.

Related

How to upload large files on Streamlit using st.file_uploader

I am trying to upload a binary file of size 10gb using st.file_uploader, however, I get the following error message. In fact, I get the same error message pretty much when I am trying to upload any file above 2gb, i.e I didn’t get this error message when I uploaded 1.2gb file.
I have set my file capacity to 10gb in config file.
As a matter of fact, the file loading shows as it should, then the file appears to be uploaded for a very short time (maybe a second), however, then the attachment disappears along with file_uploader widget itself, and the following message pops up.
By default, upload files are limited to 200MB. You can configure the server.maxUploadSize config option as such;
Set the config option in .streamlit/config.toml:
[server]
maxUploadSize=10000
https://github.com/streamlit/streamlit/issues/5938
The bug is raised in Streamlit repo. To be fixed.

How to read file from Google Drive after it was updated?

I need to place .csv file somewhere and then update it on daily basis for code to read it. The situation is the person who will be updating file will not be using coding, and it should be as easy as to upload from web for him. I read tens of question here about how to read or download file from Google Drive, Google storage but all of them suppose updating with code or downloading using API. I want little bit simpler solution. For example,
I'm using the code below to read the .csv file from google drive (this file is, for example, actual will be different).
This is the file which will be updated each day, however each time I update file (remove and upload new one into google drive) the link changes.
Is there a way to update file every time but not making change to code?
For example to make code to get file from the particular folder on Google Drive?
The main thing is that I need to do it without using API and Google Oauth.
If it's not possible where could the file be uploaded for this purpose? I need to be able to upload file every day without any code so code will read update data. Is there storage like this?
import pandas as pd
import requests
from io import StringIO
url='https://drive.google.com/file/d/1976F_8WzIxj9wJXjNyN_uD8Lrl_XtpIf/view?usp=sharing'
file_id = url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url2 = requests.get(dwn_url).text
csv_raw = StringIO(url2)
df = pd.read_csv(csv_raw)
print(df.head())
create vs update
The first thing you need to be sure is that the first time you run your code you use the file.create method.
However when you updated the file you should be using file.update this should not be creating a new file each time. There by your id will remain the same.
google api python client
IMO you should consider using the python client library this will make things a little easer for you
updated_file = service.files().update(
fileId=file_id,
body=file,
newRevision=new_revision,
media_body=media_body).execute()
Google sheets api
You are editing a csv file. By using the google drive api you are downloading the file and uploading it over and over.
Have you consider converting it to a google sheet and using the google sheet api to edit the file programmatically. May save you some processing.

How to read a part of amazon s3 key, assuming that "multipart upload complete" is yet to happen for that key?

I'm working on aws S3 multipart upload, And I am facing following issue.
Basically I am uploading a file chunk by chunk to s3, And during the time if any write happens to the file locally, I would like to reflect that change to the s3 object which is in current upload process.
Here is the procedure that I am following,
Initiate multipart upload operation.
upload the parts one by one [5 mb chunk size.] [do not complete that operation yet.]
During the time if a write goes to that file, [assuming i have the details for the write [offset, no_bytes_written] ].
I will calculate the part no for that write happen locally, And read that chunk from the s3 uploaded object.
Read the same chunk from the local file and write to read part from s3.
Upload the same part to s3 object.
This will be an a-sync operation. I will complete the multipart operation at the end.
I am facing an issue in reading the uploaded part that is in multipart uploading process. Is there any API available for the same?
Any help would be greatly appreciated.
There is no API in S3 to retrieve a part of a multi-part upload. You can list the parts but I don't believe there is any way to retrieve an individual part once it has been uploaded.
You can re-upload a part. S3 will just throw away the previous part and use the new one in it's place. So, if you had the old and new versions of the file locally and were keeping track of the parts yourself, I suppose you could, in theory, replace individual parts that had been modified after the multipart upload was initiated. However, it seems to me that this would be a very complicated and error-prone process. What if the change made to a file was to add several MB's of data to it? Wouldn't that change your boundaries? Would that potentially affect other parts, as well?
I'm not saying it can't be done but I am saying it seems complicated and would require you to do a lot of bookkeeping on the client side.

Big file stored in GCS server only partialy when served using blob key

I am writing zip files into google cloud storage using GCS client library. Then I retrieve the blob key using create_gs_key() function. Immediately after creating the file I try to download it using a second http request. I write the blob key obtained in the previous call to X-AppEngine-BlobKey header response.
When the file is relatively big, usually about 30 MB or more, the first try sometimes results in an incomplete file, a few MB smaller than the target size. If you wait a little the next try is usually fine.
I had the same problem when I tried to write files into blobstore using the API that is now deprecated.
Is it guaranteed that when you close the file it should already be available for serving?

AppEngine: Out of Blobstore and into S3?

I'm uploading data to the blobstore. There should it stay only temporary and be uploaded from within my AppEngine app to Amazon S3.
As it seems I can only get the data through a BlobDonwloadHandler as described at the Blobstore API: http://code.google.com/intl/de-DE/appengine/docs/python/blobstore/overview.html#Serving_a_Blob
So I tried to fetch that blob specific download URL from within my application (remember my question yesterday). Fetching internal URLs from within the AppEngine (not the development server) is working, even it´s bad style - I know.
But getting the blob is not working. My code looks like:
result = urllib2.urlopen('http://my-app-url.appspot.com/get_blob/'+str(qr.blob_key))
And I'm getting
DownloadError: ApplicationError: 2
Raised by:
File
"/base/python_runtime/python_lib/versions/1/google/appengine/api/urlfetch.py",
line 332, in _get_fetch_result
raise DownloadError(str(err))
Even after searching I really don't know what to do.
All tutorials I've seen are focusing only on serving through a URL back to the user. But I want to retrieve the blob from the Blobstore and send it to a S3 bucket. Anyone any idea how I could realize that or is that even not possible?
Thanks in advance.
You can use a BlobReader to access that Blob and send that to S3.
http://code.google.com/intl/de-DE/appengine/docs/python/blobstore/blobreaderclass.html

Categories

Resources