I am trying to load the data via bq_client.load_table_from_uri. I am getting below error not sure why.
I am expecting job_config as below
job_config = bigquery.LoadJobConfig(
schema=[bigquery.SchemaField("COL1","STRING"),
bigquery.SchemaField("COL2","STRING"),
bigquery.SchemaField("COL3","INTEGER"),
],
skip_leading_rows=1,
source_format=bigquery.SourceFormat.CSV,
field_delimiter='|',
max_bad_records=1
)
When I hardcode the job_config its works fine but when tries to generate dynamically its gives me error .
variable1 & variable2 are static ,bigquery_final_temp variable will be generted dynamically via some other process.
Note: job_config is hardcoded its datatype is "LoadJobConfig" but when its dynamically constructed its datatype is str. How can i convert this str datatype to "LoadJobConfig"
below is the file data,code,error
file data :
header|1234|2020-10-10
abc|xyz|12345
aws|gcp|1234
code:
uri = "gs://abc/sftp/test.csv"
var_delimitor = '|'
variable1 = 'bigquery.LoadJobConfig( schema=['
variable2 = '],skip_leading_rows=1,source_format=bigquery.SourceFormat.CSV,field_delimiter='+var_delimitor+',max_bad_records=1 )'
bigquery_final_temp = 'bigquery.SchemaField("COL1","STRING"),bigquery.SchemaField("COL2","STRING"),bigquery.SchemaField("COL3","INTEGER"),'
job_config_2=variable1 + bigquery_final_temp
job_config_final=job_config_2 + variable2
###########load data
load_job = bq_client.load_table_from_uri(
uri,Table_name, job_config=job_config_final
) # Make an API request.
print('Lots of content here 5')
load_job.result() # Wait for the job to complete.
print('Lots of content here 6')
table = bq_client.get_table(Table_name)
print("Loaded {} rows to table {}".format(table.num_rows, Table_name))
error:
TypeError: Expected an instance of LoadJobConfig class for the job_config parameter,
but received job_config = bigquery.LoadJobConfig( schema=[bigquery.SchemaField("COL1","STRING"),bigquery.SchemaField("COL2","STRING"),
bigquery.SchemaField("COL3","INTEGER"),],skip_leading_rows=1,source_format=bigquery.SourceFormat.CSV,field_delimiter='|',max_bad_records=1)
Related
I wish to create an API using which I can take Pandas dataframe as an input and store that in my DB.
I am able to do so with the csv file. However, the problem with that is that, my datatype information is lost (column datatypes like: int, array, float and so on) which is important for what I am trying to do.
I have already read this: Passing a pandas dataframe to FastAPI for NLP ML
I cannot create a class like this:
class Data(BaseModel):
# id: str
project: str
messages: str
The reason being I don't have any fixed schema. the dataframe could be of any shape with varying data types. I have created a dynamic query to create a table as per coming data frame and insert into that dataframe as well.
However, being new to fastapi, I am not able to figure out if there is an efficient way of sending this changing (dynamic) dataframe requirement of mine and store it via the queries that I have created.
If the information is not sufficient, I can try to provide more examples.
Is there a way I can send pandas dataframe from my jupyter notebook itself.
Any guidance on this would be greatly appreciated.
#router.post("/send-df")
async def push_df_funct(
target_name: Optional[str] = Form(...),
join_key: str = Form(...),
local_csv_file: UploadFile = File(None),
db: Session = Depends(pg.get_db)
):
"""
API to upload dataframe to database
"""
return upload_dataframe(db, featureset_name, local_csv_file, join_key)
def registration_cassandra(self, feature_registation_dict):
'''
# Table creation in cassandra as per the given feature registration JSON
Takes:
1. feature_registration_dict: Feature registration JSON
Returns:
- Response stating that the table has been created in cassandra
'''
logging.info(feature_registation_dict)
target_table_name = feature_registation_dict.get('featureset_name')
join_key = feature_registation_dict.get('join_key')
metadata_list = feature_registation_dict.get('metadata_list')
table_name_delimiter = "__"
logging.info(metadata_list)
column_names = [ sub['name'] for sub in metadata_list ]
data_types = [ DataType.to_cass_datatype(eval(sub['data_type']).value) for sub in metadata_list ]
logging.info(f"Column names: {column_names}")
logging.info(f"Data types: {data_types}")
ls = list(zip(column_names, data_types))
target_table_name = target_table_name + table_name_delimiter + join_key
base_query = f"CREATE TABLE {self.keyspace}.{target_table_name} ("
# CREATE TABLE images_by_month5(tid object PRIMARY KEY , cc_num object,amount object,fraud_label object,activity_time object,month object);
# create_query_new = "CREATE TABLE vpinference_dev.images_by_month4 (month int,activity_time timestamp,amount double,cc_num varint,fraud_label varint,
# tid text,PRIMARY KEY (month, activity_time, tid)) WITH CLUSTERING ORDER BY (activity_time DESC, tid ASC)"
#CREATE TABLE group_join_dates ( groupname text, joined timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) )
flag = True
for name, data_type in ls:
base_query += " " + name
base_query += " " + data_type
#if flag :
# base_query += " PRIMARY KEY "
# flag = False
base_query += ','
create_query = base_query.strip(',').rstrip(' ') + ', month varchar, activity_time timestamp,' + ' PRIMARY KEY (' + f'month, activity_time, {join_key}) )' + f' WITH CLUSTERING ORDER BY (activity_time DESC, {join_key} ASC' + ');'
logging.info(f"Query to create table in cassandra: {create_query}")
try:
session = self.get_session()
session.execute((create_query))
except Exception as e:
logging.exception(f"Some error occurred while doing the registration in cassandra. Details :: {str(e)}")
raise AppException(f"Some error occurred while doing the registration in cassandra. Details :: {str(e)}")
response = f"Table created successfully in cassandra at: vpinference_dev.{target_table_name}__{join_key};"
return response
This is the dictionary that I am passing:
feature_registation_dict = {
'featureSetName': 'data_type_testing_29',
'teamName': 'Harsh',
'frequency': 'DAILY',
'joinKey': 'tid',
'model_version': 'v1',
'model_name': 'data type testing',
'metadata_list': [{'name': 'tid',
'data_type': 'text',
'definition': 'Credit Card Number (Unique)'},
{'name': 'cc_num',
'data_type': 'bigint',
'definition': 'Aggregated Metric: Average number of transactions for the card aggregated by past 10 minutes'},
{'name': 'amount',
'data_type': 'double',
'definition': 'Aggregated Metric: Average transaction amount for the card aggregated by past 10 minutes'},
{'name': 'datetime',
'data_type': 'text',
'definition': 'Required feature for event timestamp'}]}
Not sure I understood exactly what you need but I'll give it a try. To send any dataframe to fastapi, you could do something like:
#fastapi
#app.post("/receive_df")
def receive_df(df_in: str):
df = pd.DataFrame.read_json(df_in)
#jupyter
payload={"df_in":df.to_json()}
requests.post("localhost:8000/receive_df", data=payload)
Can't really test this right now, there's probably some mistakes in there but the gist is just serializing the DataFrame to json and then serializing it in the endpoint. If you need (json) validation, you can also use the pydantic.Json data type. If there is no fixed schema then you can't use BaseModel in any useful way. But just sending a plain json string should be all you need, if your data comes only from reliable sources (your jupyter Notebook).
I currenlt have a piece of code which I'm trying to get into airflow.
The code takes data from a table and preforms the functions in the def map_manufacturer_model and outputs it into bigquery table.
I know the code works on my pc but, when I'm moving it to airflow I get the error:
TypeError: missing 1 required positional argument: 's'
Can anyone offer any guidance on this as I'm pretty stuck now!
Thanks
import pandas as pd
import datetime
from airflow.operators import python_operator
from google.cloud import bigquery
from airflow import models
client = bigquery.Client()
bqclient = bigquery.Client()
# Output table for dataframe
table_id = "Staging"
# Dataframe Code
query_string = """
SELECT * FROM `Staging`
"""
gasdata= (
bqclient.query(query_string)
.result()
.to_dataframe(
create_bqstorage_client=True,
))
manufacturers = {'G4F0': 'FLN', 'G4F1': 'FLN', 'G4F9': 'FLN', 'G4K0': 'HWL', 'E6S1': 'LPG', 'E6S2': 'LPG'}
meter_models = {'G4F0': {'1': 'G4SZV-1', '2': 'G4SZV-2'},
'G4F9': {'': 'G4SZV-1'},
'G4F1': {'': 'G4SDZV-2'},
'G4K0': {'': 'BK-G4E'},
'E6S1': {'': 'E6VG470'},
'E6S2': {'': 'E6VG470'},
}
def map_manufacturer_model(s):
s = str(s)
model = ''
try:
manufacturer = manufacturers[s[:4]]
for k, m in meter_models[s[:4]].items():
if s[-4:].startswith(k):
model = m
break
except KeyError:
manufacturer = ''
return pd.Series({'NewMeterManufacturer': manufacturer,
'NewMeterModel': model
})
gas_data[['NewMeterManufacturer', 'NewMeterModel']] = gas_data['NewSerialNumber'].apply(map_manufacturer_model)
job_config = bigquery.LoadJobConfig(
# Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.
schema=[
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(
gas_data, table_id, job_config=job_config
) # Make an API request.
job.result() # Wait for the job to complete.
table = client.get_table(table_id) # Make an API request.
print(
"Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)
print('Loaded DATAFRAME into BQ TABLE')
default_dag_args = {'owner': 'Owner',
'start_date': datetime.datetime(2021, 12, 15),
'retries': 1,
}
with models.DAG('test_dag',
schedule_interval='0 8 * * *',
default_args=default_dag_args) as dag:
map_manufacturer_model_function = python_operator.PythonOperator(
task_id='map_manufacturer_model_function',
python_callable=map_manufacturer_model
)
When you actually call the PythonOperator you are not passing in the s argument.
You can add the op_kwargs, to the arguments passed as part of the call as below:
map_manufacturer_model_function = python_operator.PythonOperator(
task_id='map_manufacturer_model_function',
python_callable=map_manufacturer_model,
op_kwargs={'s':'stringValue'}
)
See the python operator example here:
https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html
I have a Dataflow pipeline that fetches data from Pub/Sub and prepares them for insertion into Big Query and them writes them into the Database.
It works fine, it can generate the schema automatically and it is able to recognise what datatype to use and everything.
However the data we are using with it can vary vastly in format. Ex: we can get both A and B for a single column
A {"name":"John"}
B {"name":["Albert", "Einstein"]}
If the first message we get gets added, then adding the second one will not work.
If i do it the other way around it does however.
i always get the following error:
INFO:root:Error: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/project/projectname/jobs?uploadType=resumable: Provided Schema does not match Table project:test_dataset.test_table. Field cars has changed mode from NULLABLE to REPEATED with loading dataframe
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fcb9003f2c0>, due to an exception.
Traceback (most recent call last):
........
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
.....
Provided Schema does not match Table project.test_table. Field cars has changed mode from NULLABLE to REPEATED
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "newmain.py", line 211, in process
if load_job and load_job.errors:
UnboundLocalError: local variable 'load_job' referenced before assignment
Below is the code
class WriteDataframeToBQ(beam.DoFn):
def __init__(self, bq_dataset, bq_table, project_id):
self.bq_dataset = bq_dataset
self.bq_table = bq_table
self.project_id = project_id
def start_bundle(self):
self.client = bigquery.Client()
def process(self, df):
# table where we're going to store the data
table_id = f"{self.bq_dataset}.{self.bq_table}"
# function to help with the json -> bq schema transformations
generator = SchemaGenerator(input_format='dict', quoted_values_are_strings=True, keep_nulls=True)
# Get original schema to assist the deduce_schema function. If the table doesn't exist
# proceed with empty original_schema_map
try:
table = self.client.get_table(table_id)
original_schema = table.schema
self.client.schema_to_json(original_schema, "original_schema.json")
with open("original_schema.json") as f:
original_schema = json.load(f)
original_schema_map, original_schema_error_logs = generator.deduce_schema(input_data=original_schema)
except Exception:
logging.info(f"{table_id} table not exists. Proceed without getting schema")
original_schema_map = {}
# convert dataframe to dict
json_text = df.to_dict('records')
# generate the new schema, we need to write it to a file because schema_from_json only accepts json file as input
schema_map, error_logs = generator.deduce_schema(input_data=json_text, schema_map=original_schema_map)
schema = generator.flatten_schema(schema_map)
schema_file_name = "schema_map.json"
with open(schema_file_name, "w") as output_file:
json.dump(schema, output_file)
# convert the generated schema to a version that BQ understands
bq_schema = self.client.schema_from_json(schema_file_name)
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
schema_update_options=[
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
],
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
schema=bq_schema
)
job_config.schema = bq_schema
try:
load_job = self.client.load_table_from_json(
json_text,
table_id,
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
if load_job.errors:
logging.info(f"error_result = {load_job.error_result}")
logging.info(f"errors = {load_job.errors}")
else:
logging.info(f'Loaded {len(df)} rows.')
except Exception as error:
logging.info(f'Error: {error} with loading dataframe')
if load_job and load_job.errors:
logging.info(f"error_result = {load_job.error_result}")
logging.info(f"errors = {load_job.errors}")
def run(argv):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args, save_main_session=True, streaming=True)
options = pipeline_options.view_as(JobOptions)
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription=options.input_subscription)
| "Write Raw Data to Big Query" >> beam.ParDo(WriteDataframeToBQ(project_id=options.project_id, bq_dataset=options.bigquery_dataset, bq_table=options.bigquery_table))
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run(sys.argv)
Is there a way to change the restrictions of the table to make this work?
BigQuery isn't a document database, but a columnar oriented database. In addition, you can't update the schema of existing columns (only add or remove them).
For your use case, and because you can't know/predict the most generic schema of each of your field, the safer is to store the raw JSON as a string, and then to use the JSON functions of BigQuery to post process, in SQL, your data
I'm trying to pass a BigQuery table name as a value provider for a apache beam pipeline template. According to their documentation and this StackOverflow answer, it's possible to pass a value provider to apache_beam.io.gcp.bigquery.ReadFromBigQuery.
So this is the code for my pipeline
class UserOptions(PipelineOptions):
"""Define runtime argument"""
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--input', type=str)
parser.add_value_provider_argument('--output', type=str)
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
user_options = pipeline_options.view_as(UserOptions)
(p | 'Read from BQ Table' >> beam.io.gcp.bigquery.ReadFromBigQuery(
user_options.input
)
When I run the code locally, the command line passes the value for user_options.input is --input projectid.dataset_id.table
However, I had the error:
ValueError: A BigQuery table or a query must be specified
I tried:
Pass projectid:dataset_id.table
use bigquery.TableReference -> not possible
Use f'{user_options.input}'
Pass a query -> works when run locally but does not work when I call the template on GCP. Error statement:
missing dataset while no default dataset is set in the request.", "errors": [ { "message": "Table name "RuntimeValueProvider(option: input, type: str, default_value: None)" missing dataset while no default dataset is set in the request.", "domain": "global", "reason": "invalid" } ], "status": "INVALID_ARGUMENT" } } >
What am I missing?
The table argument must be passed by name to ReadFromBigQuery.
BigQuerySource (deprecated) accepts a table as the first argument so you can pass one in by position (docs). But ReadFromBigQuery expects the gcs_location as the first argument (docs). So if you are porting code from using BigQuerySource to using ReadFromBigQuery and you weren't explicitly passing the table in by name, it will fail with the error you received.
Here are two working examples and one that does not work:
import apache_beam as beam
project_id = 'my_project'
dataset_id = 'my_dataset'
table_id = 'my_table'
if __name__ == "__main__":
args = [
'--temp_location=gs://my_temp_bucket',
]
# This works:
with beam.Pipeline(argv=args) as pipeline:
query_results = (
pipeline
| 'Read from BigQuery'
>> beam.io.ReadFromBigQuery(table=f"{project_id}:{dataset_id}.{table_id}")
)
# So does this:
with beam.Pipeline(argv=args) as pipeline:
query_results = (
pipeline
| 'Read from BigQuery'
>> beam.io.ReadFromBigQuery(table=f"{dataset_id}.{table_id}", project=project_id)
)
# But this doesn't work becuase the table argument is not passed in by name.
# The f"{project_id}:{dataset_id}.{table_id}" string is interpreted as the gcs_location.
with beam.Pipeline(argv=args) as pipeline:
query_results = (
pipeline
| 'Read from BigQuery'
>> beam.io.ReadFromBigQuery(f"{project_id}:{dataset_id}.{table_id}")
)
Using Python.
Is there any way to add an extra field while processing a csv file to Big Query.
I'd like to add a date_loaded field with the current date ?
Google code example I have used ..
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING')
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv'
load_job = client.load_table_from_uri(
uri,
dataset_ref.table('us_states'),
job_config=job_config) # API request
print('Starting job {}'.format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print('Job finished.')
destination_table = client.get_table(dataset_ref.table('us_states'))
print('Loaded {} rows.'.format(destination_table.num_rows))
By modifying this Python example to fit your issue you open and read the original CSV file from my local PC, edit it by adding a column and append timestamps at the end of each line to avoid having an empty column. This link explains how to get a timestamp in Python with custom date and time.
Then you write the resulting data to an output file and load it to Google Storage. Here you can find the information on how to run external commands from a Python file.
I hope this helps.
#Import the dependencies
import csv,datetime,subprocess
from google.cloud import bigquery
#Replace the values for variables with the appropriate ones
#Name of the input csv file
csv_in_name = 'us-states.csv'
#Name of the output csv file to avoid messing up the original
csv_out_name = 'out_file_us-states.csv'
#Name of the NEW COLUMN NAME to be added
new_col_name = 'date_loaded'
#Type of the new column
col_type = 'DATETIME'
#Name of your bucket
bucket_id = 'YOUR BUCKET ID'
#Your dataset name
ds_id = 'YOUR DATASET ID'
#The destination table name
destination_table_name = 'TABLE NAME'
# read and write csv files
with open(csv_in_name,'r') as r_csvfile:
with open(csv_out_name,'w') as w_csvfile:
dict_reader = csv.DictReader(r_csvfile,delimiter=',')
#add new column with existing
fieldnames = dict_reader.fieldnames + [new_col_name]
writer_csv = csv.DictWriter(w_csvfile,fieldnames,delimiter=',')
writer_csv.writeheader()
for row in dict_reader:
#Put the timestamp after the last comma so that the column is not empty
row[new_col_name] = datetime.datetime.now()
writer_csv.writerow(row)
#Copy the file to your Google Storage bucket
subprocess.call('gsutil cp ' + csv_out_name + ' gs://' + bucket_id , shell=True)
client = bigquery.Client()
dataset_ref = client.dataset(ds_id)
job_config = bigquery.LoadJobConfig()
#Add a new column to the schema!
job_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING'),
bigquery.SchemaField(new_col_name, col_type)
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#Address string of the output csv file
uri = 'gs://' + bucket_id + '/' + csv_out_name
load_job = client.load_table_from_uri(uri,dataset_ref.table(destination_table_name),job_config=job_config) # API request
print('Starting job {}'.format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print('Job finished.')
destination_table = client.get_table(dataset_ref.table(destination_table_name))
print('Loaded {} rows.'.format(destination_table.num_rows))
You can keep loading your data as you are loading, but into a table called old_table.
Once loaded, you can run something like:
bq --location=US query --destination_table mydataset.newtable --use_legacy_sql=false --replace=true 'select *, current_date() as date_loaded from mydataset.old_table'
This basically loads the content of old table with a new column of date_loaded at the end to the new_table. This way, you now have a new column without downloading locally or all the mess.