Loading data directly from BigQuery to Orange - python

I would like to ask you if there is possibility in Orange to load to it data directly eg. from BigQuery? I added block "Python script" to the flow and my script looks like that:
import os
import sys
import Orange
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'path toapplication_default_credentials.json'
from google.cloud import storage
from google.cloud import bigquery
from google.cloud import secretmanager
bigquery_client = google.cloud.bigquery.Client(project='my project')
secret_client = secretmanager.SecretManagerServiceClient()
query ="""
my query
"""
query_job = bigquery_client.query(query)
out_data = query_job
I was trying to put also Orange.data.Table on query_job but it is not working. How can I load data directly from Python to orange?

Related

Load from pandas to Bigquery on google colab

I have written a load job in python using google colab for developing purposes but every time I run the code it loads the index into the bigquery table. However, when I run it on cloud fucntions the same code it does not load the index column.
Index is the default index column pandas creates.
My code is as follow:
import pandas as pd
from google.cloud import bigquery
import time
from google.cloud import storage
import re
import os
from datetime import datetime, date, timezone
from datetime import date
from dateutil import tz
import numpy as np
job_config = bigquery.LoadJobConfig(
schema=[bigquery.SchemaField("fecha", bigquery.enums.SqlTypeNames.DATE)],
write_disposition="WRITE_TRUNCATE"
,create_disposition = "CREATE_IF_NEEDED"
,time_partitioning = bigquery.table.TimePartitioning(field="fecha")
#,schema_update_options = 'ALLOW_FIELD_ADDITION'
)
client = bigquery.Client()
job = client.load_table_from_dataframe(df, table_id,job_config=job_config)
My requirements.txt in cloud functions includes the following libraries
pandas
fsspec
gcsfs
google-cloud-bigquery
pyarrow
google-cloud-storage
openpyxl

Read an object from Alibaba OSS and modify it using pandas python

So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud.
I am currently executing a Python script, wherein:
I download the file into my local machine.
Do the changes using Python script in my local machine.
Store it in AWS Cloud.
I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script.
The Python script will be uploaded into Task Management of Alibaba Cloud.
So the new steps will be:
Read a file from the OSS bucket into Pandas.
Modify it - Merging it with other data, some column changes. - Will be done in pandas.
Store the modified file into AWS RDS.
I am stuck at the first step itself.
Error Log:
"No module found" for OSS2 & pandas.
What is the correct way of doing it?
This is a rough draft of my script (on how was able to execute script in my local machine):
import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
import os,re
import oss2
import datetime as dt
import pandas as pd
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
import io ## include this new library
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
bucket_object = bucket.get_object(order_file).read() ## read the file from OSS
img_buf = io.BytesIO(bucket_object))
df = pd.read_csv(img_buf) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName

Read excel data from Azure blob and convert into csv using Python azure function

I would like to deploy azure function with following functionality
read excel data from Azure blob into stream object instead of downloading onto VM.
read into data frame
I require help to read the excel file to into data frame. How to update placed holder download_file_path to read excel data .
import pandas as pd
import os
import io
from azure.storage.blob import BlobClient,BlobServiceClient,ContentSettings
connectionstring="XXXXXXXXXXXXXXXX"
excelcontainer = "excelcontainer"
excelblobname="Resource.xlsx"
sheet ="Resource"
blob_service_client =BlobServiceClient.from_connection_string(connectionstring)
download_file_path =os.path.join(excelcontainer)
blob_client = blob_service_client.get_blob_client(container=excelcontainer, blob=excelblobname)
with open(download_file_path, "rb") as f:
data_bytes = f.read()
df =pd.read_excel(data_bytes, sheet_name=sheet, encoding = "utf-16")
If you want to read an excel file from Azure blob with panda, you have two choice
Generate SAS token for the blob then use blob URL with SAS token to access it
from datetime import datetime, timedelta
import pandas as pd
from azure.storage.blob import BlobSasPermissions, generate_blob_sas
def main(req: func.HttpRequest) -> func.HttpResponse:
account_name = 'andyprivate'
account_key = 'h4pP1fe76*****A=='
container_name = 'test'
blob_name="sample.xlsx"
sas=generate_blob_sas(
account_name=account_name,
container_name=container_name,
blob_name=blob_name,
account_key=account_key,
permission=BlobSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1)
)
blob_url = f'https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{sas}'
df=pd.read_excel(blob_url)
print(df)
......
Download the blob
from azure.storage.blob import BlobServiceClient
def main(req: func.HttpRequest) -> func.HttpResponse:
account_name = 'andyprivate'
account_key = 'h4pP1f****='
blob_service_client = BlobServiceClient(account_url=f'https://{account_name }.blob.core.windows.net/', credential=account_key)
blob_client = blob_service_client.get_blob_client(container='test', blob='sample.xlsx')
downloader =blob_client.download_blob()
df=pd.read_excel(downloader.readall())
print(df)
....

CSV to .GZ using Cloud function, Python

I've been trying to compress my CSV files to .gz before uploading to GCS using Cloud Function-Python 3.7, but what my code does only adds the .gz extension but doesn't really compress the file, so in the end, the file was corrupted. Can you please show me how to fix this? Thanks
here is part of my code
import gzip
def to_gcs(request):
job_config = bigquery.QueryJobConfig()
gcs_filename = 'filename_{}.csv'
bucket_name = 'bucket_gcs_name'
subfolder = 'subfolder_name'
client = bigquery.Client()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
QUERY = "SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` session, UNNEST(hits) AS hits"
query_job = client.query(
QUERY,
location='US',
job_config=job_config)
while not query_job.done():
time.sleep(1)
rows_df = query_job.result().to_dataframe()
storage_client = storage.Client()
storage_client.get_bucket(bucket_name).blob(subfolder+'/'+gcs_filename+'.gz').upload_from_string(rows_df.to_csv(sep='|',index=False,encoding='utf-8',compression='gzip'), content_type='application/octet-stream')
As suggested in the thread referred by #Sam Mason in a comment, once you have obtained the Pandas datafame, you should use a TextIOWrapper() and BytesIO() as described in the following sample:
The following sample was inspired by #ramhiser's answer in this SO thread
df = query_job.result().to_dataframe()
blob = bucket.blob(f'{subfolder}/{gcs_filename}.gz')
with BytesIO() as gz_buffer:
with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)
blob.upload_from_file(gz_buffer,
content_type='application/octet-stream')
also note that if you expect this file to ever get larger than a couple of MB you are probably better off using something from the tempfile module in place of BytesIO. SpooledTemporaryFile is basically designed for this use case, where it will use a memory buffer up to some given size and only use the disk if the file gets really big
Hi I tried to reproduce your use case:
I created a cloud function using this quickstart link:
def hello_world(request):
from google.cloud import bigquery
from google.cloud import storage
import pandas as pd
client = bigquery.Client()
storage_client = storage.Client()
path = '/tmp/file.gz'
query_job = client.query("""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10""")
results = query_job.result().to_dataframe()
results.to_csv(path,sep='|',index=False,encoding='utf-8',compression='gzip')
bucket = storage_client.get_bucket('mybucket')
blob = bucket.blob('file.gz')
blob.upload_from_filename(path)
This is the requirements.txt:
# Function dependencies, for example:
google-cloud-bigquery
google-cloud-storage
pandas
I deployed the function.
I checked the output.
gsutil cp gs://mybucket/file.gz file.gz
gzip -d file.gz
cat file
#url|view_count
https://stackoverflow.com/questions/22879669|52306
https://stackoverflow.com/questions/13530967|46073
https://stackoverflow.com/questions/35159967|45991
https://stackoverflow.com/questions/10604135|45238
https://stackoverflow.com/questions/16609219|37758
https://stackoverflow.com/questions/11647201|32963
https://stackoverflow.com/questions/13221978|32507
https://stackoverflow.com/questions/27060396|31630
https://stackoverflow.com/questions/6607552|31487
https://stackoverflow.com/questions/11057219|29069

Create Avro file directly to Google Cloud Storage

I'd like to skip the step of creating an avro file locally and uploading it directly to Google Cloud Storage.
I checked the blob.upload from_string option but honestly I don't know what it should replace to apply to my code. And I don't know if it's the best way out for what I need. With that I could build a more modern pipeline by including the script inside a docker image.
This can be done somehow based on the script below:
import csv
import base64
import json
import io
import avro.schema
import avro.io
from avro.datafile import DataFileReader, DataFileWriter
import math
import os
import gcloud
from gcloud import storage
from google.cloud import bigquery
from oauth2client.client import GoogleCredentials
from datetime import datetime, timedelta
import numpy as np
try:
script_path = os.path.dirname(os.path.abspath(__file__)) + "/"
except:
script_path = "C:\\Users\\me\\Documents\\Keys\\key.json"
#Bigquery Credentials and settings
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = script_path
folder = str((datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d'))
bucket_name = 'gs://new_bucket/table/*.csv'
dataset = 'dataset'
tabela = 'table'
schema = avro.schema.Parse(open("C:\\Users\\me\\schema_table.avsc", "rb").read())
writer = DataFileWriter(open("C:\\Users\\me\\table_register.avro", "wb"), avro.io.DatumWriter(), schema)
def insert_bigquery(target_uri, dataset_id, table_id):
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
bigquery.SchemaField('id','STRING',mode='REQUIRED')
]
job_config.source_format = bigquery.SourceFormat.CSV
job_config.field_delimiter = ";"
uri = target_uri
load_job = bigquery_client.load_table_from_uri(
uri,
dataset_ref.table(table_id),
job_config=job_config
)
print('Starting job {}'.format(load_job.job_id))
load_job.result()
print('Job finished.')
#insert_bigquery(bucket_name, dataset, tabela)
def get_data_from_bigquery():
"""query bigquery to get data to import to PSQL"""
bq = bigquery.Client()
#Busca IDs
query = """SELECT id FROM dataset.base64_data"""
query_job = bq.query(query)
data = query_job.result()
rows = list(data)
return rows
a = get_data_from_bigquery()
length = len(a)
line_count = 0
for row in range(length):
bytes = base64.b64decode(str(a[row][0]))
bytes = bytes[5:]
buf = io.BytesIO(bytes)
decoder = avro.io.BinaryDecoder(buf)
rec_reader = avro.io.DatumReader(avro.schema.Parse(open("C:\\Users\\me\\schema_table.avsc").read()))
out=rec_reader.read(decoder)
writer.append(out)
writer.close()
def upload_blob(bucket_name, source_file_name, destination_blob_name):
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob("insert_transfer/" + destination_blob_name)
blob.upload_from_filename(source_file_name)
print('File {} uploaded to {}'.format(
source_file_name,
destination_blob_name
))
upload_blob('new_bucket', 'C:\\Users\\me\\table_register.avro', 'table_register.avro')
I have seen your script and I can see that you are getting data from BigQuery. I can confirm you that I reproduced your scenario and I am able to export data from BigQuery to Google Cloud Storage directly, without creating the avro file locally.
I suggest you to take a look here where it describes how to export table data from BigQuery to Google Cloud Storage. Here are the steps to follow:
Open the BigQuery web UI in your Cloud Console.
In the navigation panel, in the Resources section, expand your project and click
your dataset to expand it. Find and click the table that contains the data you're
exporting.
On the right side of the window, click Export then select Export to Cloud Storage
In the Export to Cloud Storage dialog:
For Select Cloud Storage location, browse for the bucket.
For Export format, choose the format for your exported data, in your specific
case, choose “Avro”.
Click Export.
Nonetheless, there’s also the possibility to do it with Python. I recommend you to take a look here.
I hope this approach works for you.

Categories

Resources