In Big Query, I have a table with 608 GB of data, 50 million rows, and 2651 columns. I'm trying to load it into Jupyter Lab as a pandas dataframe before doing any modeling. I'm saving the query's results into a pandas dataframe as a destination using %%bigquery. However, because of the big size, I'm getting an error. I followed the documentation here and a couple of stackoverflow discussions (this) that suggested using LIMIT and setting query.allow large results = True. However, I am unable to determine how I can apply them to my specific problem.
Kindly please advise.
Thanks.
If you want to use configuration.query.allowLargeResults and set it to true, you should add a destination table object.
Set allowLargeResults to true in your job configuration.
If you are using python, you can see this example using allow_large_results and set it to true.
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
# table_id = "your-project.your_dataset.your_table_name"
# Set the destination table and use_legacy_sql to True to use
# legacy SQL syntax.
job_config = bigquery.QueryJobConfig(
allow_large_results=True, destination=table_id, use_legacy_sql=True
)
sql = """
SELECT corpus
FROM [bigquery-public-data:samples.shakespeare]
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # Wait for the job to complete.
print("Query results loaded to the table {}".format(table_id))
If you are querying via API
"configuration": { "query": { "allowLargeResults": true, "query": "select uid from [project:dataset.table]" "destinationTable": [project:dataset.table] } }
Using allow_large_results has limitations. These are the limitations:
You must specify a destination table.
You cannot specify a top-level ORDER BY, TOP, or LIMIT clause.
Window functions can return large query results only if used in
conjunction with a PARTITION BY clause.
You can see this official documentation.
Related
Using Airflow, I am trying to get the data from one table to insert it into another in BigQuery. I have 5 origin tables and 5 destination tables. My SQL query and python logic work for the 4 tables where it successfully gets the data and inserts it into their respective destination tables, but it doesn't work for 1 table.
query = '''SELECT * EXCEPT(eventdate) FROM `gcp_project.gcp_dataset.gcp_table_1`
WHERE id = "1234"
AND eventdate = "2023-01-18"
'''
# Delete the previous destination tables if existed
bigquery_client.delete_table("gcp_project.gcp_dataset.dest_gcp_table_1", not_found_ok=True)
job_config = bigquery.QueryJobConfig()
table_ref = bigquery_client.dataset(gcp_dataset).table(dest_gcp_table_1)
job_config.destination = table_ref
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TURNCATE
# Start the query, passing in the extra configuration.
query_job = bigquery_client.query(query=query,
location='US',
job_config=job_config
)
#check if the table is successfully written
while not query_job.done():
time.sleep(1)
logging.info("Data is written into a destination table with {} number of rows for id {}."
.format(query_job.result().total_rows, id))
I have even tried using the SQL query with CREATE OR REPLACE but the result was still the same table_1 is coming as empty. I have also tried BigQueryInsertJobOperator, but table_1 still comes empty.
Note: Size of the Table_1 data is around 270 MB with 1463306 rows, it is also the biggest out of all the tables data when it comes to inserting it into another table
I tried to execute the above logic from my local machine and it works fine for table_1 as well, I see the data in GCP BigQuery.
I am not sure why and what's happening behind this. Does anyone have any idea why this happening or what can it cause?
Found the root cause for this, the previous query which is responsible for populating the origin table was still running in the GCP BigQuery backend. Because of that the above query did get any data.
Solution: introduced query_job.result() This will wait for the job to be complete and then execute the next query.
I appreciate this may be quite trivial, but I am struggling to find an elegant solution.
Providing that I have access to modify the job configuration, through python in this case.
When I am invoking a load job through BigQuery Python API using schema autodetect to parse a CSV file into a BigQuery table.
Is it possible to ignore certain columns as a part of the load job?
For Example
I am creating a LoadJob sourced from the following CSV file (which I have formatted to make it easier on the eyes).
First_Name, Age, Gender
John, 26, Male
Is it possible to invoke a LoadJob using the Python BigQuery API which will produce the following table -
----------------
| Age | Gender |
----------------
| 26 | Male |
My current solution uses an external table which is as follows, creates an external table and uses SQL to filter and save the result as a new table, surely there is an easier way to do this via the bigquery.job.LoadJobConfig class.
# Configure external Table
uri = 'gs://my-bucket/myFile.csv'
table = bigQuery.Table('my-project.my-dataset.my-temporary-table')
external_config = bigquery.ExternalConfig("CSV")
external_config.source_uris = [
uri
]
external_config.options.skip_leading_rows = 1
external_config.autodetect = True
table.external_data_configuration = external_config
# Create External Table
external_table = bq_client.create_table(table)
filtered_table = bigquery.Table('my-project.my-dataset-filtered-table')
filtered_table_config = bigquery.QueryJobConfig(destination=filtered_table)
sql =
f"""
SELECT Age, Gender
FROM {external_table.project}.{external_table.dataset_id}.{external_table.table_id}
"""
# Starting the query, passing in the extra configuration.
query_job = bq_client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # async, wait for response from API.
bq_client.delete_table(table=external_table, timeout=30) # Remove the external table
No, you can't do that. The best way is to load the whole file in a table and then to performa select only on the desired column.
Ok, you will get useless space in BigQuery, but you won't pay for the processing (because you don't select the unwanted column).
If there are too much unused data, the next solution is to query the useful data and to save them in a definitive table, and to delete the temporary table with all the data.
I'm deleting big query rows from a table, using the "pandas-gbq" library, which works fine.
However, since this is a "read" action, by default the whole table content is being fetched, and, I do not want it to occur, since that is unnecessary.
This is my current code below, any ideas about a way to perform a delete action without fetching the table as a df?
Thanks in advance.
Delete gbq table rows - Today and yesterday
sql = """
DELETE FROM `bla.bla.bla`
WHERE Day = '{today}' OR Day = '{yesterday}'
"""
sql = sql.format(today = curr_date, yesterday= prev_date)
pandas_gbq.read_gbq(sql, project_id=project_id, credentials=credentials)
Why not use google-cloud-bigquery to invoke the query, which provides better access to the BQ API surface?
pandas_gbq by its nature provides only a subset to enable integration with the pandas ecosystem. See this document for more information about the differences and migrating between the two.
Here's a quick equivalent using the google-cloud-bigquery:
def do_the_thing():
from google.cloud import bigquery
bqclient = bigquery.Client()
sql = """
DELETE FROM `bla.bla.bla`
WHERE Day = '{today}' OR Day = '{yesterday}'
"""
query = bqclient.query(sql)
print("started query as {}".format(query.job_id))
# invoke result() to wait until completion
# DDL/DML queries don't return rows so we don't need a row iterator
query.result()
I'm creating a quick-and-dirty ETL using Python from one database server (currently DB2) to another (MSSQL). I'm just landing the data, so no transformation is occurring. The code I've written works, but it must retrieve the entire dataset first, and then insert the entire dataset to the destination.
I'd like to create a solution that would allow me to specify 'x' number of rows to pull from the source, and batch them to the destination.
I'm sure there's an elegant solution out there, but I'm not familiar enough with Python. I'm just looking for recommendations on an implementation, methods to use, or techniques.
I'm using SQLAlchemy and pandas to accomplish the task. My source and destination tables are identical (as much as possible, since datatypes differ between SQL implementations). I'm populating a dataframe, then bulk inserting the data using MetaData and automap_base.
Bulk insert function
def bulkInsert(engine, df, tableName, schemaName = 'dbo'):
metadata = MetaData()
metadata.reflect(engine, only = [tableName], schema = schemaName)
Base = automap_base(metadata = metadata)
Base.prepare()
tableToInsert = Base.classes[tableName]
conn = engine.connect()
Session = sessionmaker(bind = conn)
session = Session()
session.bulk_insert_mappings(tableToInsert, df.to_dict(orient="records"), render_nulls = True)
session.commit()
session.close()
conn.close()
Grab the source data
db2 = db2Connect(db2Server)
df = pd.read_sql(query, db2, coerce_float=False)
db2.close()
Set up destination
engine = mssqlSAEngine(server, database)
Start bulk insert, replace NaN with NULL
bulkInsert(engine, df.where(pd.notnull(df), None), tableName)
I've had no trouble with successfully inserting data. However, when I approach the million row mark, my system runs out of memory, and data starts paging. Naturally, performance degrades substantially.
We do have other tools in house (SSIS for example), but I'm looking for a dynamic method. In SSIS, I can either write a C# script task to basically accomplish what I'm doing here in Python, or create custom DFT's for each table. With this method, I just need to pass the source and destination.
I am currently using BigQuery's stream option to load data into tables. However, tables that have date partition on do not show any partitions... I am aware of this being an effect of the streaming.
The Python code I use:
def stream_data(dataset_name, table_name, data):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
# Reload the table to get the schema.
table.reload()
rows = data
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}:{}'.format(dataset_name, table_name))
else:
print('Errors:')
print(errors)
Will date partitioned tables eventually show and if no, how can I create an insert job to realize this?
Not sure what you mean by "partitions not being shown" but when you create a partitioned table you will only see one single table.
The only difference here is that you can query in this table for date partitions, like so:
SELECT
*
FROM
mydataset.partitioned_table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-12-25')
AND TIMESTAMP('2016-12-31');
As you can see in this example, partitioned tables have the meta column _PARTITIONTIME and that's what you use to select the partitions you are interested in.
For more info, here are the docs explaining a bit more about querying data in partitioned tables.