Set schema for only one column in BigQuery - python

I have a .csv file that I want to append to my BigQuery dataset/table for which one column is in the format dd.mm.yyyy. As I would like to work with partitioned tables, I need one column to be of the format DATE.
However, I am unsure how to set the schema for just one column. I tried the following:
from google.cloud import bigquery as bq
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bq.LoadJobConfig()
job_config.write_disposition = bq.WriteDisposition.WRITE_APPEND
job_config.source_format = bq.SourceFormat.CSV
job_config.field_delimiter = delimiter
job_config.skip_leading_rows = 1
job_config.autodetect = True
job_config.schema_update_options = [
bq.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
]
job_config.schema = [
bq.SchemaField('date_col', 'DATE')
]
job = client.load_table_from_file(
source_file,
table_ref,
location="europe-west2", # Must match the destination dataset location.
job_config=job_config) # API request
job.result() # Waits for table load to complete.
but it gives the error:
google.api_core.exceptions.BadRequest: 400 Error while reading data,
error message: CSV table encountered too many errors, giving up. Rows:
1; errors: 1. Please look into the errors[] collection for more
details.
When I take out the .schema option then it works fine, but then it imports the column as a STRING.

You can not specify only one column in the schema since all the columns names and types are required when setting it. On the other hand, date in format dd.mm.yyyy can not be parsed as DATE when loading into BigQuery, so you have to load it as STRING, then parse it after imported into BigQuery. Otherwise, you will have to change your data format to YYYY-MM-DD.

Related

How to get/fetch certain columns from dynamoDB using python in Lambda?

I have a table called 'DATA' in dynamodb where I have 20 to 25 columns. But I need to pull only 3 columns from dynamodb.
Required columns are status, ticket_id and country
table_name = 'DATA'
# dynamodb client
dynamodb_client = boto3.client('dynamodb')
Required columns are status, ticket_id
I'm able to achieve using scan as provided below. But I want to do the same using query method.
response = table.scan(AttributesToGet=['ticket_id','ticket_status'])
I tried the below code with query method. But I'm getting error.
response = table.query(ProjectionExpression=['ticket_id','ticket_status']),keyConditionExpression('opco_type').eq('cwc') or keyConditionExpression('opco_type').eq('cwp'))
Is there any way of getting only required columns from dynamo?
As already commented, you need to use ProjectExpression:
dynamodb = boto3.resource('dynamodb', region_name=region)
table = dynamodb.Table(table_name)
item = table.get_item(Key={'Title': 'Scarface', 'Year': 1983}, ProjectionExpression='status, ticket_id, country')
Some things to note:
It is better to use resource instead of client. This will avoid special dynamodb json syntax.
You need to set the full (composite) key to get_item
Selected columns should be in a comma-separated string
It is a good idea to always use expression attribute names:
item = table.get_item(Key={'Title': 'Scarface', 'Year': 1983},
ProjectionExpression='#status, ticket_id, country',
ExpressionAttributeNames={'#status': 'status'})

AWS Glue Data moving from S3 to Redshift

I have around 70 tables in one S3 bucket and I would like to move them to the redshift using glue. I could move only few tables. Rest of them are having data type issue. Redshift is not accepting some of the data types. I resolved the issue in a set of code which moves tables one by one:
table1 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name="table1"
)
table1 = table1.resolveChoice(
specs=[
("column1", "cast:char"),
("column2", "cast:varchar"),
("column3", "cast:varchar"),
]
)
table1 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=table1,
catalog_connection="redshift",
connection_options={"dbtable": "schema1.table1", "database": "db1"},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="table1",
)
The same script is used for all other tables having data type change issue.
But, As I would like to automate the script, I used looping tables script which iterate through all the tables and write them to redshift. I have 2 issues related to this script.
Unable to move the tables to respective schemas in redshift.
Unable to add if condition in the loop script for those tables which needs data type change.
client = boto3.client("glue", region_name="us-east-1")
databaseName = "db1_g"
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables["TableList"]
for table in tableList:
tableName = table["Name"]
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name=tableName, transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=datasource0,
catalog_connection="redshift",
connection_options={
"dbtable": tableName,
"database": "schema1.db1",
},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="datasink4",
)
job.commit()
Mentioning redshift schema name along with tableName like this: schema1.tableName is throwing error which says schema1 is not defined.
Can anybody help in changing data type for all tables which requires the same, inside the looping script itself?
So the first problem is fixed rather easily. The schema belongs into the dbtable attribute and not the database, like this:
connection_options={
"dbtable": f"schema1.{tableName},
"database": "db1",
}
Your second problem is that you want to call resolveChoice inside of the for Loop, correct? What kind of error occurs there? Why doesn't it work?

Skip forbidden rows from a BigQuery query, using Python

I need to download a relatively small table from BigQuery and store it (after some parsing) in a Panda dataframe .
Here is the relevant sample of my code:
from google.cloud import bigquery
client = bigquery.Client(project="project_id")
job_config = bigquery.QueryJobConfig(allow_large_results=True)
query_job = client.query("my sql string", job_config=job_config)
result = query_job.result()
rows = [dict(row) for row in result]
pdf = pd.DataFrame.from_dict(rows)
My problem:
After a few thousands rows parsed, one of them is too big and I get an exception: google.api_core.exceptions.Forbidden.
So, after a few iterations, I tried to transform my loop to something that looks like:
rows = list()
for _ in range(result.total_rows):
try:
rows.append(dict(next(result)))
except google.api_core.exceptions.Forbidden:
pass
BUT it doesn't work since result is a bigquery.table.RowIterator and despite its name, it's not an iterator... it's an iterable
So... what do I do now? Is there a way to either:
ask for the next row in a try/except scope?
tell bigquery to skip bad rows?
Did you try paging through query results?
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER BY total_people DESC
"""
query_job = client.query(query) # Make an API request.
query_job.result() # Wait for the query to complete.
# Get the destination table for the query results.
#
# All queries write to a destination table. If a destination table is not
# specified, the BigQuery populates it with a reference to a temporary
# anonymous table after the query completes.
destination = query_job.destination
# Get the schema (and other properties) for the destination table.
#
# A schema is useful for converting from BigQuery types to Python types.
destination = client.get_table(destination)
# Download rows.
#
# The client library automatically handles pagination.
print("The query data:")
rows = client.list_rows(destination, max_results=20)
for row in rows:
print("name={}, count={}".format(row["name"], row["total_people"]))
Also you can try to filter out big rows in your query:
WHERE LENGTH(some_field) < 123
or
WHERE LENGTH(CAST(some_field AS BYTES)) < 123

problem with list_rows with max_results value set and to_dataframe in Kaggle's "Intro to SQL" course

I could use some help. In part 1, "Getting Started with SQL and BigQuery", I'm running into the following issue. I've gotten down to In[7]:
# Preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()
and I get the error:
getting_started_with_bigquery.py:41: UserWarning: Cannot use bqstorage_client if max_results is set, reverting to fetching data with the tabledata.list endpoint.
client.list_rows(table, max_results=5).to_dataframe()
I'm writing my code in Notepad++ then running by calling it in the command prompt on Windows. I've gotten everything else working up until this point, but I'm having trouble finding a solution to this problem. A Google search leads me to the source code for google.cloud.bigquery.table which looks like that error should come up if pandas is not installed, so I installed it and I added import pandas to my code, but I'm still getting the same error.
Here is my full code:
from google.cloud import bigquery
import os
import pandas
#need to set credential path
credential_path = (r"C:\Users\crlas\learningPython\google_application_credentials.json")
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
#create a "Client" object
client = bigquery.Client()
#construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
#API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
#list all tables in the dataset
tables = list(client.list_tables(dataset))
#print all table names
for table in tables:
print(table.table_id)
print()
#construct a reference to the "full" table
table_ref = dataset_ref.table("full")
#API request - fetch the dataset
table = client.get_table(table_ref)
#print info on all the columns in the "full" table
print(table.schema)
# print("table schema should have printed above")
print()
#preview first 5 lines of the table
client.list_rows(table, max_results=5).to_dataframe()
As the warning message says - UserWarning: Cannot use bqstorage_client if max_results is set, reverting to fetching data with the tabledata.list endpoint.
So this is still working with the warning and using tabledata api to retrieve data.
You just need to point the output to a dataframe object and print it, like below:
df = client.list_rows(table, max_results=5).to_dataframe()
print(df)

Auto-schema for time-partitioned tables in BigQuery

I am trying to append data to a time-partitioned table. We can create a time-partitioned table as follows:
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table('my_partitioned_table')
schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING'),
bigquery.SchemaField('date', 'DATE')
]
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field='date', # name of column to use for partitioning
expiration_ms=7776000000) # 90 days
table = client.create_table(table)
print('Created table {}, partitioned on column {}'.format(
table.table_id, table.time_partitioning.field))
I was wondering however do to the following without pre-defining the schema as I am looking for a generic way to append new data.
When I remove the schema in the example above I get the error that a time partioned table requires a pre-defined schema. However, my files have changed over time meaning that I cannot and do no not want to redefine my schema (I will use Google DataPrep to clean it afterwards).
How I can solve it?
You can update the schema of table when you append new data into it. The two supported schema updates are adding new fields and relaxing a required filed to optional. Search for schemaUpdateOptions this help page:

Categories

Resources