I'm trying to export data from a DynamoDB transaction table using Python. Until now I was able to get all the data from the table but I would like to add a filter that allows me to only get the data from a certain date until today.
There is a field called CreatedAt that indicates the time when the transaction was made, I was thinking of using this field to filter the new data.
This is the code I've been using to query the table, it would be really helpful if anyone can tell me how to apply this filter into this script.
import pandas as pd
from boto3.dynamodb.conditions
aws_access_key_id = '*****'
aws_secret_access_key = '*****'
region='****'
dynamodb = boto3.resource(
'dynamodb',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region
)
transactions_table = dynamodb.Table('transactions_table')
result = transactions_table.scan()
items = result['Items']
df_transactions_table = pd.json_normalize(items)
print(df_transactions_table)
Thanks!
Boto3 allows for FilterExpressions as part of a DynamoDB query that will achieve filtering on the field. See here
Optionally using FilterExpressions will still consume the same amount of read capacity units.
You need to use FilterExpression which would look like the following:
import boto3
from boto3.dynamodb.conditions import Key, Attr, And
aws_access_key_id = '*****'
aws_secret_access_key = '*****'
region='****'
dynamodb = boto3.resource(
'dynamodb',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region
)
transactions_table = dynamodb.Table('transactions_table')
result = transactions_table.scan(
FilterExpression=Attr('CreatedAt').gt('2020-08-10'),
)
items = result['Items']
df_transactions_table = pd.json_normalize(items)
print(df_transactions_table)
You can learn more from the docs on Boto3 Scan and FilterExpression.
Some advice: Please do not hard code your keys the way you have done in this code, use an IAM role. If you are testing locally, configure the AWS CLI which will provide credentials that you can assume when testing, that way you wont make a mistake and share keys on GitHub etc...
Related
My requirement is to use python script to read data from AWS Glue Database into a dataframe. When I researched I fought the library - "awswrangler". I'm using the below code to connect and read data:
import awswrangler as wr
profile_name = 'aws_profile_dev'
REGION = 'us-east-1'
#Retreiving credentials to connect to AWS
ACCESS_KEY_ID, SECRET_ACCESS_KEY,SESSION_TOKEN = get_profile_credentials(profile_name)
session = boto3.session.Session(
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=SECRET_ACCESS_KEY,
aws_session_token=SESSION_TOKEN
)
my_df= wr.athena.read_sql_table(table= 'mytable_1', database= 'shared_db', boto3_session=session)
However, when I'm running the above code, I'm getting the following error - "ValueError: year 0 is out of range"
Alternatively, I tried using another library - "pyathena". The code I'm trying to use is:
from pyathena import connect
import pandas as pd
conn = connect(aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=SECRET_ACCESS_KEY,
aws_session_token=SESSION_TOKEN,
s3_staging_dir='s3://my-sample-bucket/',
region_name='us-east-1')
df = pd.read_sql("select * from AwsDataCatalog.shared_db.mytable_1 limit 1000", conn)
Using this, I'm able to retrieve data, but it works only if I'm using limit. i.e.., If I'm just running query without limit i.e.., "select * from AwsDataCatalog.shared_db.mytable_1", it's giving the error - ValueError: year 0 is out of range
Weird behavior - For example, If I run:
df = pd.read_sql("select * from AwsDataCatalog.shared_db.mytable_1 limit 1200", conn)
sometimes it's giving the same error, and if I simply reduce the limit value and run (for example as limit 1199), and later again when I run it back with limit 1200 it works. But this doesn't work if I'm trying to read more than ~1300 rows. I have a total 2002 rows in the table. I need to read the entire table.
Please help! Thank you!
Use following code in python to get data what you are looking for.
import boto3
query = "SELECT * from table_name"
s3_resource = boto3.resource("s3")
s3_client = boto3.client('s3')
DATABASE = 'database_name'
output='s3://output-bucket/output-folder'
athena_client = boto3.client('athena')
# Execution
response = athena_client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': output,
}
)
queryId = response['QueryExecutionId']
I have found a way using awswrangler to query data directly from Athena into pandas dataframe on your local machine. This doesn't require us to provide output location on S3.
profile_name = 'Dev-AWS'
REGION = 'us-east-1'
#this automatically retrieves credentials from your aws credentials file after you run aws configure on command-line
ACCESS_KEY_ID, SECRET_ACCESS_KEY,SESSION_TOKEN = get_profile_credentials(profile_name)
session = boto3.session.Session(
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=SECRET_ACCESS_KEY,
aws_session_token=SESSION_TOKEN
)
wr.athena.read_sql_query("select * from table_name", database="db_name", boto3_session=session)
Alternatively, if you don't want to query Athena, but want to read entire glue table, you can use:
my_df = wr.athena.read_sql_table(table= 'my_table', database= 'my_db', boto3_session=session)
I'm trying to create a storage account in Azure and upload a blob into it using their python SDK.
I managed to create an account like this:
client = get_client_from_auth_file(StorageManagementClient)
storage_account = client.storage_accounts.create(
resourceGroup,
name,
StorageAccountCreateParameters(
sku=Sku(name=SkuName.standard_ragrs),
enable_https_traffic_only=True,
kind=Kind.storage,
location=region)).result()
The problem is that later I'm trying to build a container and I don't know what to insert as "account_url"
I have tried doing:
client = get_client_from_auth_file(BlobServiceClient, account_url=storage_account.primary_endpoints.blob)
return client.create_container(name)
But I'm getting:
azure.core.exceptions.ResourceNotFoundError: The specified resource does not exist
I did manage to create a container using:
client = get_client_from_auth_file(StorageManagementClient)
return client.blob_containers.create(
resourceGroup,
storage_account.name,
name,
BlobContainer(),
public_access=PublicAccess.Container
)
But later when I'm trying to upload a blob using BlobServiceClient or BlobClien I still need the "account_url" so I'm still getting an error:
azure.core.exceptions.ResourceNotFoundError: The specified resource does not exist
Anyone can help me to understand how do I get the account_url for a storage account I created with the SDK?
EDIT:
I managed to find a workaround to the problem by creating the connection string from the storage keys.
storage_client = get_client_from_auth_file(StorageManagementClient)
storage_keys = storage_client.storage_accounts.list_keys(resource_group, account_name)
storage_key = next(v.value for v in storage_keys.keys)
return BlobServiceClient.from_connection_string(
'DefaultEndpointsProtocol=https;' +
f'AccountName={account_name};' +
f'AccountKey={storage_key};' +
'EndpointSuffix=core.windows.net')
This works but I thin George Chen answer is more elegant.
I could reproduce this problem, then I found get_client_from_auth_file could not pass the credential to the BlobServiceClient, cause if just create BlobServiceClient with account_url without credential it also could print the account name.
So if you want to use a credential to get BlobServiceClient, you could use the below code, then do other operations.
credentials = ClientSecretCredential(
'tenant_id',
'application_id',
'application_secret'
)
blobserviceclient=BlobServiceClient(account_url=storage_account.primary_endpoints.blob,credential=credentials)
If you don't want this way, you could create the BlobServiceClient with the account key.
client = get_client_from_auth_file(StorageManagementClient,auth_path='auth')
storage_account = client.storage_accounts.create(
'group name',
'account name',
StorageAccountCreateParameters(
sku=Sku(name=SkuName.standard_ragrs),
enable_https_traffic_only=True,
kind=Kind.storage,
location='eastus',)).result()
storage_keys = client.storage_accounts.list_keys(resource_group_name='group name',account_name='account name')
storage_keys = {v.key_name: v.value for v in storage_keys.keys}
blobserviceclient=BlobServiceClient(account_url=storage_account.primary_endpoints.blob,credential=storage_keys['key1'])
blobserviceclient.create_container(name='container name')
my question is about a code to extract a table extract a table from Bigquery and save it as a json file
.
I made my code mostly by following the gcloud tutorials on their documentation.
I couldn't implicit set my credentials, so I did it in a explicit way, to my json file. But it seems that it doesn't quite get the "Client" object by the path I took.
If anyone could clarify me how this whole implicit and explicit credential works, would help me a lot too!
I am using python 2.7 and pycharm. The code is as follows:
from gcloud import bigquery
from google.cloud import storage
def bigquery_get_rows ():
json_key = "path/to/my/json_file.json"
storage_client = storage.Client.from_service_account_json(json_key)
print("\nPeguei o Cliente\n")
# Make an authenticated API request
buckets = list(storage_client.list_buckets())
print(buckets)
print(storage_client)
#Setando ambiente
bucket_name = 'my_bucket/name'
print(bucket_name)
destination_uri = 'gs://{}/{}'.format(bucket_name, 'my_table_json_name.json')
print(destination_uri)
#dataset_ref = client.dataset('samples', project='my_project_name')
dataset_ref = storage_client.dataset('my_dataset_name', project='my_project_id')
print(dataset_ref)
table_ref = dataset_ref.table('my_table_to_be_extracted_name')
print(table_ref)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = (
bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)
extract_job = client.extract_table(
table_ref, destination_uri, job_config=job_config) # API request
extract_job.result() # Waits for job to complete.
bigquery_get_rows()
You are using wrong client object. You try to use gcs client to work with bigquery.
Instead of
dataset_ref = storage_client.dataset('my_dataset_name', project='my_project_id')
it should be:
bq_client = bigquery.Client.from_service_account_json(
'path/to/service_account.json')
dataset_ref = bq_client.dataset('my_dataset_name', project='my_project_id')
This is the query that I have been running in BigQuery that I want to run in my python script. How would I change this/ what do I have to add for it to run in Python.
#standardSQL
SELECT
Serial,
MAX(createdAt) AS Latest_Use,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
From what I have been researching it is saying that I cant save this query as a permanent table using Python. Is that true? and if it is true is it possible to still export a temporary table?
You need to use the BigQuery Python client lib, then something like this should get you up and running:
from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
query = "SELECT...."
dataset = client.dataset('dataset')
table = dataset.table(name='table')
job = client.run_async_query('my-job', query)
job.destination = table
job.write_disposition= 'WRITE_TRUNCATE'
job.begin()
https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html
See the current BigQuery Python client tutorial.
Here is another way using a JSON file for the service account:
>>> from google.cloud import bigquery
>>>
>>> CREDS = 'test_service_account.json'
>>> client = bigquery.Client.from_service_account_json(json_credentials_path=CREDS)
>>> job = client.query('select * from dataset1.mytable')
>>> for row in job.result():
... print(row)
This is a good usage guide:
https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/index.html
To simply run and write a query:
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'your_dataset_id'
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table("your_table_id")
job_config.destination = table_ref
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(
sql,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location="US",
job_config=job_config,
) # API request - starts the query
query_job.result() # Waits for the query to finish
print("Query results loaded to table {}".format(table_ref.path))
I personally prefer querying using pandas:
# BQ authentication
import pydata_google_auth
SCOPES = [
'https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/drive',
]
credentials = pydata_google_auth.get_user_credentials(
SCOPES,
# Set auth_local_webserver to True to have a slightly more convienient
# authorization flow. Note, this doesn't work if you're running from a
# notebook on a remote sever, such as over SSH or with Google Colab.
auth_local_webserver=True,
)
query = "SELECT * FROM my_table"
data = pd.read_gbq(query, project_id = MY_PROJECT_ID, credentials=credentials, dialect = 'standard')
The pythonbq package is very simple to use and a great place to start. It uses python-gbq.
To get started you would need to generate a BQ json key for external app access. You can generate your key here.
Your code would look something like:
from pythonbq import pythonbq
myProject=pythonbq(
bq_key_path='path/to/bq/key.json',
project_id='myGoogleProjectID'
)
SQL_CODE="""
SELECT
Serial,
MAX(createdAt) AS Latest_Use,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
"""
output=myProject.query(sql=SQL_CODE)
Im trying to store in ~/.boto file the user aws_access_key_id and aws_secret_access_key keys.
I'm already storing aws_access_key_id correctly, but now, I don't know how can I get aws_secret_access_key so I can store it in ~/.boto file.
Do you know how can I get aws_secret_access_key?
import os
import boto.iam.connection
username = "user"
conection = boto.iam.connect_to_region("us-east-1")
conection.create_access_key(username)
conection.create_access_key(username)
k = conection.get_all_access_keys(username)
ackey = k['list_access_keys_response']['list_access_keys_result']['access_key_metadata'][0]['access_key_id']
# and how to return the aws_secret_access_key??
with open(os.path.expanduser("~/.boto"),"w") as f:
f.write("[Credentials]")
f.write("/n")
f.write("aws_access_key_id" + ackey)
f.write("/n")
f.write("aws_secret_access_key" + ??)
The secret_access_key associated with AWS API credentials is returned via the API when the access key is created. You must store the key at that point because it is never returned by the IAM service again. If you change your code to be something like this, you can capture the secret key at key creation time.
conection = boto.iam.connect_to_region("us-east-1")
response = connection.create_access_key(username)
secret_access_key = response['create_access_key_response']['create_access_key_result']['access_key']['secret_access_key']