I have written a load job in python using google colab for developing purposes but every time I run the code it loads the index into the bigquery table. However, when I run it on cloud fucntions the same code it does not load the index column.
Index is the default index column pandas creates.
My code is as follow:
import pandas as pd
from google.cloud import bigquery
import time
from google.cloud import storage
import re
import os
from datetime import datetime, date, timezone
from datetime import date
from dateutil import tz
import numpy as np
job_config = bigquery.LoadJobConfig(
schema=[bigquery.SchemaField("fecha", bigquery.enums.SqlTypeNames.DATE)],
write_disposition="WRITE_TRUNCATE"
,create_disposition = "CREATE_IF_NEEDED"
,time_partitioning = bigquery.table.TimePartitioning(field="fecha")
#,schema_update_options = 'ALLOW_FIELD_ADDITION'
)
client = bigquery.Client()
job = client.load_table_from_dataframe(df, table_id,job_config=job_config)
My requirements.txt in cloud functions includes the following libraries
pandas
fsspec
gcsfs
google-cloud-bigquery
pyarrow
google-cloud-storage
openpyxl
Related
I am trying to read a csv from an S3 bucket using my jupyter notebook. I have previously read this csv before and had no issues but now am receiving an error.
Here is the code I am running:
import pandas as pd
list = pd.read_csv(r's3://analytics/wordlist.csv')
And the error I am getting is:
An error was encountered:
_register_s3_control_events() takes 2 positional arguments but 6 were given
I thought it may be the S3 bucket permissioning but it is public to my organization and so shouldn't be the issue.
Any ideas what might be wrong?
You could use boto to import the csv from s3. Boto is a python library for AWS.
By the way, this should work:
import boto
import pandas as pd
data = pd.read_csv('s3://bucket....csv')
If you are on python 3.4+, you need:
import boto3
import io
import pandas as pd
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
Having a little trouble. Trying to insert my yfinance data into google sheets. Any help is much appreciated.
!pip install yfinance
import time
import numpy as np
import pandas as pd
from datetime import datetime
import math
from oauth2client.service_account import ServiceAccountCredentials
import gspread
import yfinance as yf
scope = ['https://www.googleapis.com/auth/spreadsheets',
'https://www.googleapis.com/auth/drive.file', 'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('yfinancenegs.json',scope)
client = gspread.authorize(creds)
Structure = client.open('Structure').worksheet('EURUSD')
df = yf.download(tickers='EURUSD=X', period='1d', interval='5m')
df = df.to_json()
Below is where I probably need the most help. It just plots the data into 1 cell.
Structure.update_cell(1,1,df)
I am trying to read a gsheet file in Google drive using Google Collab. I tried using drive.mount to get the file but I don't know how to get a dataframe with pandas from there. Here what I tried to do :
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
import os
import pandas as pd
from google.colab import drive
# setup
gc = gspread.authorize(GoogleCredentials.get_application_default())
drive.mount('/content/drive',force_remount=True)
# read data and put it in a dataframe
gsheets = gc.open_by_url('/content/drive/MyDrive/test/myGoogleSheet.gsheet')
As you can tell, I am quite lost with the libraries. I want to use the ability to access the drive with the drive library, to get the content from gspread, and read with pandas.
Can anyone help me find a solution, please ?
I have found a solution for my problem by looking further into the library gspread. I was able to load the gsheet file by id or by url which I did not know. Then I manage to get the content of a sheet and read it as pandas dataframe. Here is the code :
from google.colab import auth
auth.authenticate_user()
import gspread
import pandas as pd
from oauth2client.client import GoogleCredentials
# setup
gc = gspread.authorize(GoogleCredentials.get_application_default())
# read data and put it in a dataframe
# spreadsheet = gc.open_by_url('https://docs.google.com/spreadsheets/d/google_sheet_id/edit#gid=0')
spreadsheet = gc.open_by_key('google_sheet_id')
wks = spreadsheet.worksheet('sheet_name')
data = wks.get_all_values()
headers = data.pop(0)
df = pd.DataFrame(data, columns=headers)
print(df)
So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud.
I am currently executing a Python script, wherein:
I download the file into my local machine.
Do the changes using Python script in my local machine.
Store it in AWS Cloud.
I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script.
The Python script will be uploaded into Task Management of Alibaba Cloud.
So the new steps will be:
Read a file from the OSS bucket into Pandas.
Modify it - Merging it with other data, some column changes. - Will be done in pandas.
Store the modified file into AWS RDS.
I am stuck at the first step itself.
Error Log:
"No module found" for OSS2 & pandas.
What is the correct way of doing it?
This is a rough draft of my script (on how was able to execute script in my local machine):
import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
import os,re
import oss2
import datetime as dt
import pandas as pd
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
import io ## include this new library
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
bucket_object = bucket.get_object(order_file).read() ## read the file from OSS
img_buf = io.BytesIO(bucket_object))
df = pd.read_csv(img_buf) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
I would like to ask you if there is possibility in Orange to load to it data directly eg. from BigQuery? I added block "Python script" to the flow and my script looks like that:
import os
import sys
import Orange
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'path toapplication_default_credentials.json'
from google.cloud import storage
from google.cloud import bigquery
from google.cloud import secretmanager
bigquery_client = google.cloud.bigquery.Client(project='my project')
secret_client = secretmanager.SecretManagerServiceClient()
query ="""
my query
"""
query_job = bigquery_client.query(query)
out_data = query_job
I was trying to put also Orange.data.Table on query_job but it is not working. How can I load data directly from Python to orange?