Skip forbidden rows from a BigQuery query, using Python - python

I need to download a relatively small table from BigQuery and store it (after some parsing) in a Panda dataframe .
Here is the relevant sample of my code:
from google.cloud import bigquery
client = bigquery.Client(project="project_id")
job_config = bigquery.QueryJobConfig(allow_large_results=True)
query_job = client.query("my sql string", job_config=job_config)
result = query_job.result()
rows = [dict(row) for row in result]
pdf = pd.DataFrame.from_dict(rows)
My problem:
After a few thousands rows parsed, one of them is too big and I get an exception: google.api_core.exceptions.Forbidden.
So, after a few iterations, I tried to transform my loop to something that looks like:
rows = list()
for _ in range(result.total_rows):
try:
rows.append(dict(next(result)))
except google.api_core.exceptions.Forbidden:
pass
BUT it doesn't work since result is a bigquery.table.RowIterator and despite its name, it's not an iterator... it's an iterable
So... what do I do now? Is there a way to either:
ask for the next row in a try/except scope?
tell bigquery to skip bad rows?

Did you try paging through query results?
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER BY total_people DESC
"""
query_job = client.query(query) # Make an API request.
query_job.result() # Wait for the query to complete.
# Get the destination table for the query results.
#
# All queries write to a destination table. If a destination table is not
# specified, the BigQuery populates it with a reference to a temporary
# anonymous table after the query completes.
destination = query_job.destination
# Get the schema (and other properties) for the destination table.
#
# A schema is useful for converting from BigQuery types to Python types.
destination = client.get_table(destination)
# Download rows.
#
# The client library automatically handles pagination.
print("The query data:")
rows = client.list_rows(destination, max_results=20)
for row in rows:
print("name={}, count={}".format(row["name"], row["total_people"]))
Also you can try to filter out big rows in your query:
WHERE LENGTH(some_field) < 123
or
WHERE LENGTH(CAST(some_field AS BYTES)) < 123

Related

Syncing Data from Google Sheet to Postgres RDS

I have the data from google sheet in data frame and using Pandas df._tosql to import the data in Postgres RDS.
def gsheet2df(spreadsheet_name, sheet_num):
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
credentials_path = 'billing-342104-8b351a7a2813.json'
credentials = sac.from_json_keyfile_name(credentials_path, scope)
client = gspread.authorize(credentials)
sheet = client.open(spreadsheet_name).get_worksheet(sheet_num).get_all_records()
df = pd.DataFrame.from_dict(sheet)
print(df)
return df
def write2db(ed):
connection_string = "postgresql+psycopg2://%s:%s#%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
connection = engine.connect()
ed.to_sql('user_data', con = engine, if_exists = 'append', index='user_id')
But I have two use cases that are not been handled and explored a lot also.
When I am importing the data to DB the column names are the sheet column names. I want 2 extra columns to be added for every row one is the time at which it is getting updated and another is if deleted. With values, this should be there for every row.
I have imported the data once but now I want to sync it again and update the DB based on the changes in the sheet. I don't want to wipe out the complete DB.
Any suggestions to achieve the same.

How to get/fetch certain columns from dynamoDB using python in Lambda?

I have a table called 'DATA' in dynamodb where I have 20 to 25 columns. But I need to pull only 3 columns from dynamodb.
Required columns are status, ticket_id and country
table_name = 'DATA'
# dynamodb client
dynamodb_client = boto3.client('dynamodb')
Required columns are status, ticket_id
I'm able to achieve using scan as provided below. But I want to do the same using query method.
response = table.scan(AttributesToGet=['ticket_id','ticket_status'])
I tried the below code with query method. But I'm getting error.
response = table.query(ProjectionExpression=['ticket_id','ticket_status']),keyConditionExpression('opco_type').eq('cwc') or keyConditionExpression('opco_type').eq('cwp'))
Is there any way of getting only required columns from dynamo?
As already commented, you need to use ProjectExpression:
dynamodb = boto3.resource('dynamodb', region_name=region)
table = dynamodb.Table(table_name)
item = table.get_item(Key={'Title': 'Scarface', 'Year': 1983}, ProjectionExpression='status, ticket_id, country')
Some things to note:
It is better to use resource instead of client. This will avoid special dynamodb json syntax.
You need to set the full (composite) key to get_item
Selected columns should be in a comma-separated string
It is a good idea to always use expression attribute names:
item = table.get_item(Key={'Title': 'Scarface', 'Year': 1983},
ProjectionExpression='#status, ticket_id, country',
ExpressionAttributeNames={'#status': 'status'})

Set schema for only one column in BigQuery

I have a .csv file that I want to append to my BigQuery dataset/table for which one column is in the format dd.mm.yyyy. As I would like to work with partitioned tables, I need one column to be of the format DATE.
However, I am unsure how to set the schema for just one column. I tried the following:
from google.cloud import bigquery as bq
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bq.LoadJobConfig()
job_config.write_disposition = bq.WriteDisposition.WRITE_APPEND
job_config.source_format = bq.SourceFormat.CSV
job_config.field_delimiter = delimiter
job_config.skip_leading_rows = 1
job_config.autodetect = True
job_config.schema_update_options = [
bq.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
]
job_config.schema = [
bq.SchemaField('date_col', 'DATE')
]
job = client.load_table_from_file(
source_file,
table_ref,
location="europe-west2", # Must match the destination dataset location.
job_config=job_config) # API request
job.result() # Waits for table load to complete.
but it gives the error:
google.api_core.exceptions.BadRequest: 400 Error while reading data,
error message: CSV table encountered too many errors, giving up. Rows:
1; errors: 1. Please look into the errors[] collection for more
details.
When I take out the .schema option then it works fine, but then it imports the column as a STRING.
You can not specify only one column in the schema since all the columns names and types are required when setting it. On the other hand, date in format dd.mm.yyyy can not be parsed as DATE when loading into BigQuery, so you have to load it as STRING, then parse it after imported into BigQuery. Otherwise, you will have to change your data format to YYYY-MM-DD.

Auto-schema for time-partitioned tables in BigQuery

I am trying to append data to a time-partitioned table. We can create a time-partitioned table as follows:
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table('my_partitioned_table')
schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING'),
bigquery.SchemaField('date', 'DATE')
]
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field='date', # name of column to use for partitioning
expiration_ms=7776000000) # 90 days
table = client.create_table(table)
print('Created table {}, partitioned on column {}'.format(
table.table_id, table.time_partitioning.field))
I was wondering however do to the following without pre-defining the schema as I am looking for a generic way to append new data.
When I remove the schema in the example above I get the error that a time partioned table requires a pre-defined schema. However, my files have changed over time meaning that I cannot and do no not want to redefine my schema (I will use Google DataPrep to clean it afterwards).
How I can solve it?
You can update the schema of table when you append new data into it. The two supported schema updates are adding new fields and relaxing a required filed to optional. Search for schemaUpdateOptions this help page:

get the last modified date of tables using bigquery tables GET api

I am trying to get the list of tables and their last_modified_date using bigquery REST API.
In the bigquery API explorer I am getting all the fields correctly but when I use the api from Python code its returning 'None' for modified date.
This is the code written for the same in python
from google.cloud import bigquery
client = bigquery.Client(project='temp')
datasets = list(client.list_datasets())
for dataset in datasets:
print dataset.dataset_id
for dataset in datasets:
for table in dataset.list_tables():
print table.table_id
print table.created
print table.modified
In this code I am getting created date correctly but modified date is 'None' for all the tables.
Not quite sure which version of the API you are using but I suspect the latest versions do not have the method dataset.list_tables().
Still, this is one way of getting last modified field, see if this works for you (or gives you some idea on how to get this data):
from google.cloud import bigquery
client = bigquery.Client.from_service_account_json('/key.json')
dataset_list = list(client.list_datasets())
for dataset_item in dataset_list:
dataset = client.get_dataset(dataset_item.reference)
tables_list = list(client.list_tables(dataset))
for table_item in tables_list:
table = client.get_table(table_item.reference)
print "Table {} last modified: {}".format(
table.table_id, table.modified)
If you want to get the last modified time from only one table:
from google.cloud import bigquery
def get_last_bq_update(project, dataset, table_name):
client = bigquery.Client.from_service_account_json('/key.json')
table_id = f"{project}.{dataset}.{table_name}"
table = client.get_table(table_id)
print(table.modified)

Categories

Resources