Add a date loaded field when uploading csv to big query - python

Using Python.
Is there any way to add an extra field while processing a csv file to Big Query.
I'd like to add a date_loaded field with the current date ?
Google code example I have used ..
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING')
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv'
load_job = client.load_table_from_uri(
uri,
dataset_ref.table('us_states'),
job_config=job_config) # API request
print('Starting job {}'.format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print('Job finished.')
destination_table = client.get_table(dataset_ref.table('us_states'))
print('Loaded {} rows.'.format(destination_table.num_rows))

By modifying this Python example to fit your issue you open and read the original CSV file from my local PC, edit it by adding a column and append timestamps at the end of each line to avoid having an empty column. This link explains how to get a timestamp in Python with custom date and time.
Then you write the resulting data to an output file and load it to Google Storage. Here you can find the information on how to run external commands from a Python file.
I hope this helps.
#Import the dependencies
import csv,datetime,subprocess
from google.cloud import bigquery
#Replace the values for variables with the appropriate ones
#Name of the input csv file
csv_in_name = 'us-states.csv'
#Name of the output csv file to avoid messing up the original
csv_out_name = 'out_file_us-states.csv'
#Name of the NEW COLUMN NAME to be added
new_col_name = 'date_loaded'
#Type of the new column
col_type = 'DATETIME'
#Name of your bucket
bucket_id = 'YOUR BUCKET ID'
#Your dataset name
ds_id = 'YOUR DATASET ID'
#The destination table name
destination_table_name = 'TABLE NAME'
# read and write csv files
with open(csv_in_name,'r') as r_csvfile:
with open(csv_out_name,'w') as w_csvfile:
dict_reader = csv.DictReader(r_csvfile,delimiter=',')
#add new column with existing
fieldnames = dict_reader.fieldnames + [new_col_name]
writer_csv = csv.DictWriter(w_csvfile,fieldnames,delimiter=',')
writer_csv.writeheader()
for row in dict_reader:
#Put the timestamp after the last comma so that the column is not empty
row[new_col_name] = datetime.datetime.now()
writer_csv.writerow(row)
#Copy the file to your Google Storage bucket
subprocess.call('gsutil cp ' + csv_out_name + ' gs://' + bucket_id , shell=True)
client = bigquery.Client()
dataset_ref = client.dataset(ds_id)
job_config = bigquery.LoadJobConfig()
#Add a new column to the schema!
job_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING'),
bigquery.SchemaField(new_col_name, col_type)
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#Address string of the output csv file
uri = 'gs://' + bucket_id + '/' + csv_out_name
load_job = client.load_table_from_uri(uri,dataset_ref.table(destination_table_name),job_config=job_config) # API request
print('Starting job {}'.format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print('Job finished.')
destination_table = client.get_table(dataset_ref.table(destination_table_name))
print('Loaded {} rows.'.format(destination_table.num_rows))

You can keep loading your data as you are loading, but into a table called old_table.
Once loaded, you can run something like:
bq --location=US query --destination_table mydataset.newtable --use_legacy_sql=false --replace=true 'select *, current_date() as date_loaded from mydataset.old_table'
This basically loads the content of old table with a new column of date_loaded at the end to the new_table. This way, you now have a new column without downloading locally or all the mess.

Related

AWS Glue create_partition using boto3 successful, but Athena not showing results for query

I have a glue script to create new partitions using create_partition(). The glue script is running successfully, and i could see the partitions in the Athena console when using SHOW PARTITIONS. For glue script create_partitions, I did refer to this sample code here : https://medium.com/#bv_subhash/demystifying-the-ways-of-creating-partitions-in-glue-catalog-on-partitioned-s3-data-for-faster-e25671e65574
When I try to run a Athena query for a given partition which was newly added, I am getting no results.
Is it that I need to trigger the MSCK command, even if I add the partitions using create_partitions. Appreciate any suggestions please
.
I have got the solution myself, wanted to share with SO community, so it would be useful someone. The following code when run as a glue job, creates partitions, and can also be queried in Athena for the new partition columns. Please change/add the parameter values db name, table name, partition columns as needed.
import boto3
import urllib.parse
import os
import copy
import sys
# Configure database / table name and emp_id, file_id from workflow params?
DATABASE_NAME = 'my_db'
TABLE_NAME = 'enter_table_name'
emp_id_tmp = ''
file_id_tmp = ''
# # Initialise the Glue client using Boto 3
glue_client = boto3.client('glue')
#get current table schema for the given database name & table name
def get_current_schema(database_name, table_name):
try:
response = glue_client.get_table(
DatabaseName=DATABASE_NAME,
Name=TABLE_NAME
)
except Exception as error:
print("Exception while fetching table info")
sys.exit(-1)
# Parsing table info required to create partitions from table
table_data = {}
table_data['input_format'] = response['Table']['StorageDescriptor']['InputFormat']
table_data['output_format'] = response['Table']['StorageDescriptor']['OutputFormat']
table_data['table_location'] = response['Table']['StorageDescriptor']['Location']
table_data['serde_info'] = response['Table']['StorageDescriptor']['SerdeInfo']
table_data['partition_keys'] = response['Table']['PartitionKeys']
return table_data
#prepare partition input list using table_data
def generate_partition_input_list(table_data):
input_list = [] # Initializing empty list
part_location = "{}/emp_id={}/file_id={}/".format(table_data['table_location'], emp_id_tmp, file_id_tmp)
input_dict = {
'Values': [
emp_id_tmp, file_id_tmp
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': table_data['input_format'],
'OutputFormat': table_data['output_format'],
'SerdeInfo': table_data['serde_info']
}
}
input_list.append(input_dict.copy())
return input_list
#create partition dynamically using the partition input list
table_data = get_current_schema(DATABASE_NAME, TABLE_NAME)
input_list = generate_partition_input_list(table_data)
try:
create_partition_response = glue_client.batch_create_partition(
DatabaseName=DATABASE_NAME,
TableName=TABLE_NAME,
PartitionInputList=input_list
)
print('Glue partition created successfully.')
print(create_partition_response)
except Exception as e:
# Handle exception as per your business requirements
print(e)

Big Query how to change mode of columns?

I have a Dataflow pipeline that fetches data from Pub/Sub and prepares them for insertion into Big Query and them writes them into the Database.
It works fine, it can generate the schema automatically and it is able to recognise what datatype to use and everything.
However the data we are using with it can vary vastly in format. Ex: we can get both A and B for a single column
A {"name":"John"}
B {"name":["Albert", "Einstein"]}
If the first message we get gets added, then adding the second one will not work.
If i do it the other way around it does however.
i always get the following error:
INFO:root:Error: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/project/projectname/jobs?uploadType=resumable: Provided Schema does not match Table project:test_dataset.test_table. Field cars has changed mode from NULLABLE to REPEATED with loading dataframe
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fcb9003f2c0>, due to an exception.
Traceback (most recent call last):
........
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
.....
Provided Schema does not match Table project.test_table. Field cars has changed mode from NULLABLE to REPEATED
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "newmain.py", line 211, in process
if load_job and load_job.errors:
UnboundLocalError: local variable 'load_job' referenced before assignment
Below is the code
class WriteDataframeToBQ(beam.DoFn):
def __init__(self, bq_dataset, bq_table, project_id):
self.bq_dataset = bq_dataset
self.bq_table = bq_table
self.project_id = project_id
def start_bundle(self):
self.client = bigquery.Client()
def process(self, df):
# table where we're going to store the data
table_id = f"{self.bq_dataset}.{self.bq_table}"
# function to help with the json -> bq schema transformations
generator = SchemaGenerator(input_format='dict', quoted_values_are_strings=True, keep_nulls=True)
# Get original schema to assist the deduce_schema function. If the table doesn't exist
# proceed with empty original_schema_map
try:
table = self.client.get_table(table_id)
original_schema = table.schema
self.client.schema_to_json(original_schema, "original_schema.json")
with open("original_schema.json") as f:
original_schema = json.load(f)
original_schema_map, original_schema_error_logs = generator.deduce_schema(input_data=original_schema)
except Exception:
logging.info(f"{table_id} table not exists. Proceed without getting schema")
original_schema_map = {}
# convert dataframe to dict
json_text = df.to_dict('records')
# generate the new schema, we need to write it to a file because schema_from_json only accepts json file as input
schema_map, error_logs = generator.deduce_schema(input_data=json_text, schema_map=original_schema_map)
schema = generator.flatten_schema(schema_map)
schema_file_name = "schema_map.json"
with open(schema_file_name, "w") as output_file:
json.dump(schema, output_file)
# convert the generated schema to a version that BQ understands
bq_schema = self.client.schema_from_json(schema_file_name)
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
schema_update_options=[
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
],
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
schema=bq_schema
)
job_config.schema = bq_schema
try:
load_job = self.client.load_table_from_json(
json_text,
table_id,
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
if load_job.errors:
logging.info(f"error_result = {load_job.error_result}")
logging.info(f"errors = {load_job.errors}")
else:
logging.info(f'Loaded {len(df)} rows.')
except Exception as error:
logging.info(f'Error: {error} with loading dataframe')
if load_job and load_job.errors:
logging.info(f"error_result = {load_job.error_result}")
logging.info(f"errors = {load_job.errors}")
def run(argv):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args, save_main_session=True, streaming=True)
options = pipeline_options.view_as(JobOptions)
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription=options.input_subscription)
| "Write Raw Data to Big Query" >> beam.ParDo(WriteDataframeToBQ(project_id=options.project_id, bq_dataset=options.bigquery_dataset, bq_table=options.bigquery_table))
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run(sys.argv)
Is there a way to change the restrictions of the table to make this work?
BigQuery isn't a document database, but a columnar oriented database. In addition, you can't update the schema of existing columns (only add or remove them).
For your use case, and because you can't know/predict the most generic schema of each of your field, the safer is to store the raw JSON as a string, and then to use the JSON functions of BigQuery to post process, in SQL, your data

Issue while creating a job_config parameter dynamically in bigquery

I am trying to load the data via bq_client.load_table_from_uri. I am getting below error not sure why.
I am expecting job_config as below
job_config = bigquery.LoadJobConfig(
schema=[bigquery.SchemaField("COL1","STRING"),
bigquery.SchemaField("COL2","STRING"),
bigquery.SchemaField("COL3","INTEGER"),
],
skip_leading_rows=1,
source_format=bigquery.SourceFormat.CSV,
field_delimiter='|',
max_bad_records=1
)
When I hardcode the job_config its works fine but when tries to generate dynamically its gives me error .
variable1 & variable2 are static ,bigquery_final_temp variable will be generted dynamically via some other process.
Note: job_config is hardcoded its datatype is "LoadJobConfig" but when its dynamically constructed its datatype is str. How can i convert this str datatype to "LoadJobConfig"
below is the file data,code,error
file data :
header|1234|2020-10-10
abc|xyz|12345
aws|gcp|1234
code:
uri = "gs://abc/sftp/test.csv"
var_delimitor = '|'
variable1 = 'bigquery.LoadJobConfig( schema=['
variable2 = '],skip_leading_rows=1,source_format=bigquery.SourceFormat.CSV,field_delimiter='+var_delimitor+',max_bad_records=1 )'
bigquery_final_temp = 'bigquery.SchemaField("COL1","STRING"),bigquery.SchemaField("COL2","STRING"),bigquery.SchemaField("COL3","INTEGER"),'
job_config_2=variable1 + bigquery_final_temp
job_config_final=job_config_2 + variable2
###########load data
load_job = bq_client.load_table_from_uri(
uri,Table_name, job_config=job_config_final
) # Make an API request.
print('Lots of content here 5')
load_job.result() # Wait for the job to complete.
print('Lots of content here 6')
table = bq_client.get_table(Table_name)
print("Loaded {} rows to table {}".format(table.num_rows, Table_name))
error:
TypeError: Expected an instance of LoadJobConfig class for the job_config parameter,
but received job_config = bigquery.LoadJobConfig( schema=[bigquery.SchemaField("COL1","STRING"),bigquery.SchemaField("COL2","STRING"),
bigquery.SchemaField("COL3","INTEGER"),],skip_leading_rows=1,source_format=bigquery.SourceFormat.CSV,field_delimiter='|',max_bad_records=1)

(python) Iterating through a list of Salesforce tables to extract and load into AWS S3

Good Morning All!
I'm trying to have a routine iterate through a table list. The below code works on a single table 'contact'. I want to iterate through all of the tables listed in my tablelist.csv. I bolded the selections below which would need to be dynamically modified in the code. My brain is pretty fried at this point from working through two nights and I'm fully prepared for the internet to tell me that this is in chapter two of intro to Python but I could use the help just to get over this hurdle.
import pandas as pd
import boto3
from simple_salesforce import salesforce
li = pd.read_csv('tablelist.csv', header=none)
desc = sf.**Contact**.describe()
field_names = [field['name'] for field in desc['fields']]
soql = "SELECT {} FROM **Contact**".format(','.join(field_names))
results = sf.query_all(soql)
sf_df = pd.DataFrame(results['records']).drop(columns='attributes')
sf_df.to_csv('**contact**.csv')
s3 = boto3.client('s3')
s3.upload_file('contact.csv', 'mybucket', 'Ops/20201027/contact.csv')
Would help if you could provide a sample of the tablelist file, but here's a stab at...you really just need to get list of tables and loop through it.
#assuming table is a column somewhere in the file
df_tablelist = pd.read_csv('tablelist.csv', header=none)
for Contact in df_tablelist['yourtablecolumttoiterateon'].tolist():
desc = sf.**Contact**.describe()
field_names = [field['name'] for field in desc['fields']]
soql = "SELECT {} FROM {}".format(','.join(field_names), Contact)
results = sf.query_all(soql)
sf_df = pd.DataFrame(results['records']).drop(columns='attributes')
sf_df.to_csv(Contact + '.csv')
s3 = boto3.client('s3')
s3.upload_file(Contact + '.csv', 'mybucket', 'Ops/20201027/' + Contact + '.csv')

How do I create a CSV in Lambda using Python?

I would like to create a report in Lambda using Python that is saved in a CSV file. So you will find the code of the function:
import boto3
import datetime
import re
def lambda_handler(event, context):
client = boto3.client('ce')
now = datetime.datetime.utcnow()
end = datetime.datetime(year=now.year, month=now.month, day=1)
start = end - datetime.timedelta(days=1)
start = datetime.datetime(year=start.year, month=start.month, day=1)
start = start.strftime('%Y-%m-%d')
end = end.strftime('%Y-%m-%d')
response = client.get_cost_and_usage(
TimePeriod={
'Start': "2019-02-01",
'End': "2019-08-01"
},
Granularity='MONTHLY',
Metrics=['BlendedCost'],
GroupBy=[
{
'Type': 'TAG',
'Key': 'Project'
},
]
)
How can I create a CSV file from it?
Here is a sample function to create a CSV file in Lambda using Python:
Assuming that the variable 'response' has the required data for creating the report for you, the following piece of code will help you create a temporary CSV file in the /tmp folder of the lambda function:
import csv
temp_csv_file = csv.writer(open("/tmp/csv_file.csv", "w+"))
# writing the column names
temp_csv_file.writerow(["Account Name", "Month", "Cost"])
# writing rows in to the CSV file
for detail in response:
temp_csv_file.writerow([detail['account_name'],
detail['month'],
detail['cost']
])
Once you have created the CSV file, you can upload it S3 and send it as an email or share it as link using the following piece of code:
client = boto3.client('s3')
client.upload_file('/tmp/csv_file.csv', BUCKET_NAME,'final_report.csv')
Points to remember:
The /tmp is a directory storage of size 512 MB which can be used to store a few in memory/ temporary files
You should not rely on this storage to maintain state across sub-sequent lambda functions.
The above answer by Repakula Srushith is correct but will be creating an empty csv as the file is not being closed. You can change the code to
f = open("/tmp/csv_file.csv", "w+")
temp_csv_file = csv.writer(f)
temp_csv_file.writerow(["Account Name", "Month", "Cost"])
# writing rows in to the CSV file
for detail in response:
temp_csv_file.writerow([detail['account_name'],
detail['month'],
detail['cost']
])
f.close()

Categories

Resources