I need to place .csv file somewhere and then update it on daily basis for code to read it. The situation is the person who will be updating file will not be using coding, and it should be as easy as to upload from web for him. I read tens of question here about how to read or download file from Google Drive, Google storage but all of them suppose updating with code or downloading using API. I want little bit simpler solution. For example,
I'm using the code below to read the .csv file from google drive (this file is, for example, actual will be different).
This is the file which will be updated each day, however each time I update file (remove and upload new one into google drive) the link changes.
Is there a way to update file every time but not making change to code?
For example to make code to get file from the particular folder on Google Drive?
The main thing is that I need to do it without using API and Google Oauth.
If it's not possible where could the file be uploaded for this purpose? I need to be able to upload file every day without any code so code will read update data. Is there storage like this?
import pandas as pd
import requests
from io import StringIO
url='https://drive.google.com/file/d/1976F_8WzIxj9wJXjNyN_uD8Lrl_XtpIf/view?usp=sharing'
file_id = url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url2 = requests.get(dwn_url).text
csv_raw = StringIO(url2)
df = pd.read_csv(csv_raw)
print(df.head())
create vs update
The first thing you need to be sure is that the first time you run your code you use the file.create method.
However when you updated the file you should be using file.update this should not be creating a new file each time. There by your id will remain the same.
google api python client
IMO you should consider using the python client library this will make things a little easer for you
updated_file = service.files().update(
fileId=file_id,
body=file,
newRevision=new_revision,
media_body=media_body).execute()
Google sheets api
You are editing a csv file. By using the google drive api you are downloading the file and uploading it over and over.
Have you consider converting it to a google sheet and using the google sheet api to edit the file programmatically. May save you some processing.
Related
I'm using python office365 library to access sharepoint documents. I don't know how to access file via API that have been shared with me by sharing link. I need to get this file content and if possible metadata (last modify date). Could anyone help?
The user that I'm using have no access to this sharepoint folder other than a sharing link to a single file.
I tried many variations of normal file access API, bot by hand and by office365 library. I couldnt find a way to access a file when I have only sharing link to it.
My sharing link looks like that:
https://[redacted].sharepoint.com/:x:/s/[redacted]/dir1/dir2/ESd0HkNNSbJMhQFavQsr9-4BNHC2rHSWsnbs3zRdjtZsC3g so there is not really a filename here and I cannot read via API content of any folder per se because I have an error Attempted to perform an unathorized operation.. Authentication goes fine (when i mistake password I get different error).
According to my research and testing, you can use the following Rest API to read file (get file content):
https://xxxx.sharepoint.com/sites/xxx/_api/web/GetFolderByServerRelativeUrl('/sites/xxx/Library_Name/Folder Name')/Files('Document.docx')/$value
If you want to get last modify date, you can use the following Rest API to get the Modified field:
https://xxxx.sharepoint.com/sites/xxx/_api/web/lists/getbytitle('test_library')/Items?$select=Modified
I wish to download a file (https://covid.ourworldindata.org/data/owid-covid-data.csv) from the internet using Python and Google Cloud.
Currently, I have this code.
import os
import wget
from google.cloud import storage
url = os.environ['URL']
bucket_name = os.environ['BUCKET'] #without gs://
file_name = os.environ['FILE_NAME']
cf_path = '/tmp/{}'.format(file_name)
def import_file(event, context):
# set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket(bucket_name)
# download the file to Cloud Function's tmp directory
wget.download(url, cf_path)
# set Blob
blob = storage.Blob(file_name, bucket)
# upload the file to GCS
blob.upload_from_filename(cf_path)
print("""This Function was triggered by messageId {} published at {}""".format(context.event_id, context.timestamp))
While this code works beautifully, the Covid19 data updates on a daily with addition of new date (So if I access the file on 3/7, it would include data till 3/6) . Instead of re-writing the whole file again, I wish to only extract the newly updated rows into google storage for every time the function is run on scheduled as opposed to overwriting the file that was already saved.
I fairly weal in programming and would appreciate the help.
While the file is in csv format, there's also a JSON link (https://covid.ourworldindata.org/data/owid-covid-data.json) if it would make it easier to code.
I can figure out the portion on storing it to Cloud Storage but requires help on the code to extract the most updated rows/data in more specifically.
The usual best practice is to load data everyday in BigQuery and to partition per ingestion date.
Then, you can run a query (or create a view) to select only the most recent data of type (use the partition over syntax) (deduplicate)
I am working on a machine learning task and have saved a Keras model and want to deploy it to Github (so that I can host a web demo using Streamlit and/or Flask). However, the model file is so large (> 1 GB), that I cannot upload it to Github for free.
My thought process regarding an alternative is to upload it to a cloud service such as google drive (or dropbox, box etc.) then using some sort of Python module to access it from there.
So my question is, can I upload a pickle file containing a pickled Keras model to Google Drive and then access that object from a Python script? If so, how would I go about doing so?
Thank you!
I believe you can, you'll need to pip oauth2client & gspread. To access the data you would need to enable API manager on your google drive and get credentials in the form of a json file. Then you would need to share the file with the email in the credentials giving it permission. You could then port over the information as you needed to, I'm not sure how Keras works but this would be the first step.
Another important factor is that Google api is very touch when it comes to requests that are coming to fast, to overcome this put in sleep commands between each one, but if you do that this method may become way to slow for your idea.
scope = ["https://spreadsheets.google.com/feeds", 'https://www.googleapis.com/auth/spreadsheets',
"https://www.googleapis.com/auth/drive.file", "https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("Your json file here.json", scope)
client = gspread.authorize(creds)
sheet = client.open("your google sheets name or whatever").sheet1 # Open the spreadhseet
data = sheet.get_all_records() # you can call all the information with this.
I understand that you require a way to upload and download large files* from Drive using Python. If I understood your situation correctly, then you can achieve your goals easily by using the Drive API as #TimothyChen commented. First I highly recommend you to follow the Drive API Python Quickstart tutorial to create a working example. Later, you could modify it to use Files.create() and Files.get() to upload/download files as needed. Don't hesitate to ask me more questions if you have doubts.
*Please, keep in mind that there is a 5 TB size limit in Drive.
I am using Google Colab and I would like to use my custom libraries / scripts, that I have stored on my local machine. My current approach is the following:
# (Question 1)
from google.colab import drive
drive.mount("/content/gdrive")
# Annoying chain of granting access to Google Colab
# and entering the OAuth token.
And then I use:
# (Question 2)
!cp /content/gdrive/My\ Drive/awesome-project/*.py .
Question 1:
Is there a way to avoid the mounting of the drive entriely? Whenever the execution context changes (e.g. when I select "Hardware Acceleration = GPU", or when I wait an hour), I have to re-generate and re-enter the OAuth token.
Question 2:
Is there a way to sync files between my local machine and my Google Colab scripts more elegently?
Partial (not very satisfying answer) regarding Question 1: I saw that one could install and use Dropbox. Then you can hardcode the API Key into the application and mounting is done, regardless of whether or not it is a new execution context. I wonder if a similar approach exists based on Google Drive as well.
Question 1.
Great question and yes there is- I have been using this workaround which is particularly useful if you are a researcher and want other to be able to re run your code- or just 'colab'orate when working with larger datasets. The below method has worked well working as a team and there are challenges to each person having their own version of datasets.
I have used this regularly on 30 + Gb of image files downloaded and unzipped to colab run time.
The file id is in the link provided when you share from google drive
you can also select multiple files and select share all and then get a generate for example a .txt or .json file which you can parse and extract the file id's.
from google_drive_downloader import GoogleDriveDownloader as gdd
#some file id/ list of file ids parsed from file urls.
google_fid_id = '1-4PbytN2awBviPS4Brrb4puhzFb555g2'
destination = 'dir/dir/fid'
#if zip file ad kwarg unzip=true
gdd.download_file_from_google_drive(file_id=google_fid_id,
destination, unzip=True)
A url parsing function to get file ids from a list of urls might look like this:
def parse_urls():
with open('/dir/dir/files_urls.txt', 'r') as fb:
txt = fb.readlines()
return [url.split('/')[-2] for url in txt[0].split(',')]
One health warning is that you can only repeat this a small number of times in a 24 hour window for the same files.
Here's the gdd git repo:
https://github.com/ndrplz/google-drive-downloader
here is an working example (my own) of how it works inside bigger script:
https://github.com/fdsig/image_utils
Question 2.
You can connect to a local run time but this also means using local resources gpu/cpu etc.
Really hope this helps :-).
F~
If your code isn't secret, you can use git to sync your local codes to github. Then, git clone to Colab with no need for any authentication.
Is there any way to take a small data JSON object from memory and upload it to Google BigQuery without using the file system?
We have working code that uploads files to BigQuery. This code no longer works for us because our new script runs on Google App Engine which doesn't have write access to the file system.
We notice that our BigQuery upload configuration has a "MediaFileUpload" option with a data_path. We have been unable to find if there is some way to change the data_path to something in memory - or if there is another method apart from MediaFileUpload that could upload data from memory to BigQuery.
media_body=MediaFileUpload(
data_path,
mimetype='application/octet-stream'))
job = insert_request.execute()
We have seen solutions which recommend uploading data to Google Cloud Storage then referencing that file to upload to BigQuery for when datasets are large. Our dataset is small so this step seems like an unnecessary use of bandwidth, processing and time.
As a result, we are wondering if there is a way to take our small JSON object and upload it directly to BigQuery using Python?