get the last modified date of tables using bigquery tables GET api - python

I am trying to get the list of tables and their last_modified_date using bigquery REST API.
In the bigquery API explorer I am getting all the fields correctly but when I use the api from Python code its returning 'None' for modified date.
This is the code written for the same in python
from google.cloud import bigquery
client = bigquery.Client(project='temp')
datasets = list(client.list_datasets())
for dataset in datasets:
print dataset.dataset_id
for dataset in datasets:
for table in dataset.list_tables():
print table.table_id
print table.created
print table.modified
In this code I am getting created date correctly but modified date is 'None' for all the tables.

Not quite sure which version of the API you are using but I suspect the latest versions do not have the method dataset.list_tables().
Still, this is one way of getting last modified field, see if this works for you (or gives you some idea on how to get this data):
from google.cloud import bigquery
client = bigquery.Client.from_service_account_json('/key.json')
dataset_list = list(client.list_datasets())
for dataset_item in dataset_list:
dataset = client.get_dataset(dataset_item.reference)
tables_list = list(client.list_tables(dataset))
for table_item in tables_list:
table = client.get_table(table_item.reference)
print "Table {} last modified: {}".format(
table.table_id, table.modified)

If you want to get the last modified time from only one table:
from google.cloud import bigquery
def get_last_bq_update(project, dataset, table_name):
client = bigquery.Client.from_service_account_json('/key.json')
table_id = f"{project}.{dataset}.{table_name}"
table = client.get_table(table_id)
print(table.modified)

Related

Skip forbidden rows from a BigQuery query, using Python

I need to download a relatively small table from BigQuery and store it (after some parsing) in a Panda dataframe .
Here is the relevant sample of my code:
from google.cloud import bigquery
client = bigquery.Client(project="project_id")
job_config = bigquery.QueryJobConfig(allow_large_results=True)
query_job = client.query("my sql string", job_config=job_config)
result = query_job.result()
rows = [dict(row) for row in result]
pdf = pd.DataFrame.from_dict(rows)
My problem:
After a few thousands rows parsed, one of them is too big and I get an exception: google.api_core.exceptions.Forbidden.
So, after a few iterations, I tried to transform my loop to something that looks like:
rows = list()
for _ in range(result.total_rows):
try:
rows.append(dict(next(result)))
except google.api_core.exceptions.Forbidden:
pass
BUT it doesn't work since result is a bigquery.table.RowIterator and despite its name, it's not an iterator... it's an iterable
So... what do I do now? Is there a way to either:
ask for the next row in a try/except scope?
tell bigquery to skip bad rows?
Did you try paging through query results?
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER BY total_people DESC
"""
query_job = client.query(query) # Make an API request.
query_job.result() # Wait for the query to complete.
# Get the destination table for the query results.
#
# All queries write to a destination table. If a destination table is not
# specified, the BigQuery populates it with a reference to a temporary
# anonymous table after the query completes.
destination = query_job.destination
# Get the schema (and other properties) for the destination table.
#
# A schema is useful for converting from BigQuery types to Python types.
destination = client.get_table(destination)
# Download rows.
#
# The client library automatically handles pagination.
print("The query data:")
rows = client.list_rows(destination, max_results=20)
for row in rows:
print("name={}, count={}".format(row["name"], row["total_people"]))
Also you can try to filter out big rows in your query:
WHERE LENGTH(some_field) < 123
or
WHERE LENGTH(CAST(some_field AS BYTES)) < 123

how to get the latest database collection from MongoDB with Mongo Engine

I am very new to MongoDB. I create a database within a loop. Each time (Every 2 hours), I get data from some sources and create a data collection by MongoEngine and name each collection based on the creation time (for example 05_01_2021_17_00_30).
Now, on another python code , I want to get the latest database. how can I call the latest database collection without knowing the name of it?
I saw some guidelines in Stackoverflow but codes are old and not working now. Thanks guys.
I came up with this answer:
In mongo_setup.py: when I want to create a database, it will be named after the time of creation and save the name in a text file.
import mongoengine
import datetime
def global_init():
nownow = datetime.datetime.now()
Update_file_name = str(nownow.strftime("%d_%m_%Y_%H_%M_%S"))
# For Shaking hand between Django and the last updated data base, export the name
of the latest database
# in a text file and from there, Django will understand which database is the
latest
Updated_txt = open('.\\Latest database to read for Django.txt', '+w')
Updated_txt.write(Update_file_name)
Updated_txt.close()
mongoengine.register_connection(alias='core', name=Update_file_name)
In Django views.py: we will call the text file and read the latest database's name:
database_name_text_file = 'directory of the text file...'
db_name_file = open(database_name_text_file, 'r')
db_name = db_name_file.read()
# MongoDb Database
myclient = MongoClient(port=27017)
mydatabase = myclient[db_name]
classagg = mydatabase['aggregation__class']
database_text = classagg.find()
for i in database_text:
....

problem with list_rows with max_results value set and to_dataframe in Kaggle's "Intro to SQL" course

I could use some help. In part 1, "Getting Started with SQL and BigQuery", I'm running into the following issue. I've gotten down to In[7]:
# Preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()
and I get the error:
getting_started_with_bigquery.py:41: UserWarning: Cannot use bqstorage_client if max_results is set, reverting to fetching data with the tabledata.list endpoint.
client.list_rows(table, max_results=5).to_dataframe()
I'm writing my code in Notepad++ then running by calling it in the command prompt on Windows. I've gotten everything else working up until this point, but I'm having trouble finding a solution to this problem. A Google search leads me to the source code for google.cloud.bigquery.table which looks like that error should come up if pandas is not installed, so I installed it and I added import pandas to my code, but I'm still getting the same error.
Here is my full code:
from google.cloud import bigquery
import os
import pandas
#need to set credential path
credential_path = (r"C:\Users\crlas\learningPython\google_application_credentials.json")
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
#create a "Client" object
client = bigquery.Client()
#construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
#API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
#list all tables in the dataset
tables = list(client.list_tables(dataset))
#print all table names
for table in tables:
print(table.table_id)
print()
#construct a reference to the "full" table
table_ref = dataset_ref.table("full")
#API request - fetch the dataset
table = client.get_table(table_ref)
#print info on all the columns in the "full" table
print(table.schema)
# print("table schema should have printed above")
print()
#preview first 5 lines of the table
client.list_rows(table, max_results=5).to_dataframe()
As the warning message says - UserWarning: Cannot use bqstorage_client if max_results is set, reverting to fetching data with the tabledata.list endpoint.
So this is still working with the warning and using tabledata api to retrieve data.
You just need to point the output to a dataframe object and print it, like below:
df = client.list_rows(table, max_results=5).to_dataframe()
print(df)

Reading a MySQL query with Python where output is empty

I'm trying to connect MySQL with python in order to automate some reports. By now, I'm just testing the connection. Seems it's working but here comes the problem: the output from my Python code is different from the one that I get in MySQL.
Here I attach the query used and the output that I can find in MySQL:
The testing query for the Python connection:
SELECT accountID
FROM Account
WHERE accountID in ('340','339','343');
The output from MySQL (using Dbiever). For this test, the column chosen contains integers:
accountID
1 339
2 340
3 343
Here I attach the actual output from my Python code:
today:
20200811
Will return true if the connection works:
True
Empty DataFrame
Columns: [accountID]
Index: []
In order to help you understand the problem, please find attached my python code:
import pandas as pd
import json
import pymysql
import paramiko
from datetime import date, time
tiempo_inicial = time()
today = date.today()
today= today.strftime("%Y%m%d")
print('today:')
print(today)
#from paramiko import SSHClient
from sshtunnel import SSHTunnelForwarder
**(part that contains all the connection information, due to data protection this part can't be shared)**
print('will return true if connection works:')
print(conn.open)
query = '''SELECT accountId
FROM Account
WHERE accountID in ('340','339','343');'''
data = pd.read_sql_query(query, conn)
print(data)
conn.close()
Under my point of view doesn't have a sense this output as the connection is working and the query it's being tested previously in MySQL with a positive output. I tried with other columns that contain names or dates and the result doesn't change.
Any idea why I'm getting this "Empty DataFrame" output?
Thanks

BigQuery insert dates into 'DATE' type field using Python Google Cloud library

I'm using Python 2.7 and the Google Cloud Client Library for Python (v0.27.0) to insert data into a BigQuery table (using table.insert_data()).
One of the fields in my table has type 'DATE'.
In my Python script I've formatted the date-data as 'YYYY-MM-DD', but unfortunately the Google Cloud library returns an 'Invalid date:' error for that field.
I've tried formatting the date-field in many ways (i.e. 'YYYYMMDD', timestamp etc.), but no luck so far...
Unfortunately the API docs (https://googlecloudplatform.github.io/google-cloud-python/latest/) don't mention anything about the required date format/type/object in Python.
This is my code:
from google.cloud import bigquery
import pandas as pd
import json
from pprint import pprint
from collections import OrderedDict
# Using a pandas dataframe 'df' as input
# Converting date field to YYYY-MM-DD format
df['DATE_VALUE_LOCAL'] = df['DATE_VALUE_LOCAL'].apply(lambda x: x.strftime('%Y-%m-%d'))
# Converting pandas dataframe to json
json_data = df.to_json(orient='records',date_format='iso')
# Instantiates a client
bigquery_client = bigquery.Client(project="xxx")
# The name for the new dataset
dataset_name = 'dataset_name'
table_name = 'table_name'
def stream_data(dataset_name, table_name, json_data):
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
data = json.loads(json_data, object_pairs_hook=OrderedDict)
# Reload the table to get the schema.
table.reload()
errors = table.insert_data(data)
if not errors:
print('Loaded 1 row into {}:{}'.format(dataset_name, table_name))
else:
print('Errors:')
pprint(errors)
stream_data(dataset_name, table_name, json_data)
What is the required Python date format/type/object to insert my dates into a BigQuery DATE field?
I just simulated your code here and everything worked fine. Here's what I've simulated:
import pandas as pd
import json
import os
from collections import OrderedDict
from google.cloud.bigquery import Client
d = {'ed': ['3', '5'],
'date': ['2017-10-11', '2017-11-12']}
json_data = df.to_json(orient='records', date_formate='iso')
json_data = json.loads(json_data, object_pairs_hook=OrderedDict)
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/key.json'
bc = Client()
ds = bc.dataset('dataset name')
table = ds.table('table I just created')
table = bc.get_table(table)
bc.create_rows(table, json_data)
It's using version 0.28.0 but still it's the same methods from previous versions.
You probably have some mistake going on in some step that maybe is converting date to some other unidentifiable format for BQ. Try using this script as reference to see where the mistake might be happening in your code.

Categories

Resources