I just work with AWS DynamoDB in a short of time. I am wondering how can I get the same result with this statement (without WHERE clause):
SELECT column1 FROM DynamoTable;
I tried (but failed) with:
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('DynamoTable')
from boto3.dynamodb.conditions import Key, Attr
resp = table.query(KeyConditionExpression=Key('column1'))
It requires Key().eq() or Key().begin_with() ...
I tried with resp = table.scan() already, but the response data is too many fields while I only need column1
Thanks.
This lets you get the required column directly and you do not need to iterate over the whole dataset
import boto3
def getColumn1Items():
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('DynamoTable')
resp = table.scan(AttributesToGet=['column1'])
return resp['Items']
You should definitely use Scan operation. Check the documentation to implement it with python.
Regarding how to select just a specific attribute you could use:
import boto3
def getColumn1Items():
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('DynamoTable')
response = table.scan()
return [i['column1'] for i in response['Items']]
You have to iterate over the the entire table and just fetch the column you need.
Related
I have a table called 'DATA' in dynamodb where I have 20 to 25 columns. But I need to pull only 3 columns from dynamodb.
Required columns are status, ticket_id and country
table_name = 'DATA'
# dynamodb client
dynamodb_client = boto3.client('dynamodb')
Required columns are status, ticket_id
I'm able to achieve using scan as provided below. But I want to do the same using query method.
response = table.scan(AttributesToGet=['ticket_id','ticket_status'])
I tried the below code with query method. But I'm getting error.
response = table.query(ProjectionExpression=['ticket_id','ticket_status']),keyConditionExpression('opco_type').eq('cwc') or keyConditionExpression('opco_type').eq('cwp'))
Is there any way of getting only required columns from dynamo?
As already commented, you need to use ProjectExpression:
dynamodb = boto3.resource('dynamodb', region_name=region)
table = dynamodb.Table(table_name)
item = table.get_item(Key={'Title': 'Scarface', 'Year': 1983}, ProjectionExpression='status, ticket_id, country')
Some things to note:
It is better to use resource instead of client. This will avoid special dynamodb json syntax.
You need to set the full (composite) key to get_item
Selected columns should be in a comma-separated string
It is a good idea to always use expression attribute names:
item = table.get_item(Key={'Title': 'Scarface', 'Year': 1983},
ProjectionExpression='#status, ticket_id, country',
ExpressionAttributeNames={'#status': 'status'})
I need to download a relatively small table from BigQuery and store it (after some parsing) in a Panda dataframe .
Here is the relevant sample of my code:
from google.cloud import bigquery
client = bigquery.Client(project="project_id")
job_config = bigquery.QueryJobConfig(allow_large_results=True)
query_job = client.query("my sql string", job_config=job_config)
result = query_job.result()
rows = [dict(row) for row in result]
pdf = pd.DataFrame.from_dict(rows)
My problem:
After a few thousands rows parsed, one of them is too big and I get an exception: google.api_core.exceptions.Forbidden.
So, after a few iterations, I tried to transform my loop to something that looks like:
rows = list()
for _ in range(result.total_rows):
try:
rows.append(dict(next(result)))
except google.api_core.exceptions.Forbidden:
pass
BUT it doesn't work since result is a bigquery.table.RowIterator and despite its name, it's not an iterator... it's an iterable
So... what do I do now? Is there a way to either:
ask for the next row in a try/except scope?
tell bigquery to skip bad rows?
Did you try paging through query results?
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER BY total_people DESC
"""
query_job = client.query(query) # Make an API request.
query_job.result() # Wait for the query to complete.
# Get the destination table for the query results.
#
# All queries write to a destination table. If a destination table is not
# specified, the BigQuery populates it with a reference to a temporary
# anonymous table after the query completes.
destination = query_job.destination
# Get the schema (and other properties) for the destination table.
#
# A schema is useful for converting from BigQuery types to Python types.
destination = client.get_table(destination)
# Download rows.
#
# The client library automatically handles pagination.
print("The query data:")
rows = client.list_rows(destination, max_results=20)
for row in rows:
print("name={}, count={}".format(row["name"], row["total_people"]))
Also you can try to filter out big rows in your query:
WHERE LENGTH(some_field) < 123
or
WHERE LENGTH(CAST(some_field AS BYTES)) < 123
I am trying to get the list of tables and their last_modified_date using bigquery REST API.
In the bigquery API explorer I am getting all the fields correctly but when I use the api from Python code its returning 'None' for modified date.
This is the code written for the same in python
from google.cloud import bigquery
client = bigquery.Client(project='temp')
datasets = list(client.list_datasets())
for dataset in datasets:
print dataset.dataset_id
for dataset in datasets:
for table in dataset.list_tables():
print table.table_id
print table.created
print table.modified
In this code I am getting created date correctly but modified date is 'None' for all the tables.
Not quite sure which version of the API you are using but I suspect the latest versions do not have the method dataset.list_tables().
Still, this is one way of getting last modified field, see if this works for you (or gives you some idea on how to get this data):
from google.cloud import bigquery
client = bigquery.Client.from_service_account_json('/key.json')
dataset_list = list(client.list_datasets())
for dataset_item in dataset_list:
dataset = client.get_dataset(dataset_item.reference)
tables_list = list(client.list_tables(dataset))
for table_item in tables_list:
table = client.get_table(table_item.reference)
print "Table {} last modified: {}".format(
table.table_id, table.modified)
If you want to get the last modified time from only one table:
from google.cloud import bigquery
def get_last_bq_update(project, dataset, table_name):
client = bigquery.Client.from_service_account_json('/key.json')
table_id = f"{project}.{dataset}.{table_name}"
table = client.get_table(table_id)
print(table.modified)
I am using Lambda (Python) to query my DynamoDB database. I am using the boto3 library, and I was able to make an "equivalent" query:
This script works:
import boto3
from boto3.dynamodb.conditions import Key, Attr
import json
def create_list(event, context):
resource = boto3.resource('dynamodb')
table = resource.Table('Table_Name')
response = table.query(
TableName='Table_Name',
IndexName='Custom-Index-Name',
KeyConditionExpression=Key('Number_Attribute').eq(0)
)
return response
However, when I change the query expression to this:
KeyConditionExpression=Key('Number_Attribute').gt(0)
I get the error:
"errorType": "ClientError",
"errorMessage": "An error occurred (ValidationException) when calling the Query operation: Query key condition not supported"
According to this [1] resource, "gt" is a method of Key(). Does anyone know if this library has been updated, or what other methods are available other than "eq"?
[1] http://boto3.readthedocs.io/en/latest/reference/customizations/dynamodb.html#ref-dynamodb-conditions
---------EDIT----------
I also just tried the old method using:
response = client.query(
TableName = 'Table_Name',
IndexName='Custom_Index',
KeyConditions = {
'Custom_Number_Attribute':{
'ComparisonOperator':'EQ',
'AttributeValueList': [{'N': '0'}]
}
}
)
This worked, but when I try:
response = client.query(
TableName = 'Table_Name',
IndexName='Custom_Index',
KeyConditions = {
'Custom_Number_Attribute':{
'ComparisonOperator':'GT',
'AttributeValueList': [{'N': '0'}]
}
}
)
...it does not work.
Why would EQ be the only method working in these cases? I'm not sure what I'm missing in the documentation.
From what I think:
Your Partition Key is Number_Attribute, and so you cannot do a gt when doing a query (you can do an eq and that is it.)
You can do a gt or between for your Sort Key when doing a query. It is also called Range key, and because it "smartly" puts the items next to each other, it offers the possibility of doing gt and between efficiently in a query
Now, if you want to do a between to your partition Key, then you will have to use scan like the below:
Key('Number_Attribute').gt(0)
response = table.scan(
FilterExpression=fe
)
Keep in mind of the following concerning scan:
The scan method reads every item in the entire table, and returns all of the data in the table. You can provide an optional filter_expression, so that only the items matching your criteria are returned. However, note that the filter is only applied after the entire table has been scanned.
So in other words, it's a bit of a costly operation comparing to query. You can see an example in the documentation here.
Hope that helps!
I'm fairly new to NoSQL and using AWS DynamoDB. I'm calling it from AWS Lambda using python 2.7
I'm trying to retrieve a value from an order_number field.
This is what my table looks like(only have one record.):
primary partition key: subscription_id
and my secondary global index: order_number
Is my setup correct?
If so given the order_number how do I retrieve the record using python?
I can't figure out the syntax to do it.
I've tried
response = table.get_item( Key = {'order_number': myordernumber} )
But i get:
An error occurred (ValidationException) when calling the GetItem operation: The provided key element does not match the schema: ClientError
DynamoDB does not automatically index all of the fields of your object. By default you can define a hash key (subscription_id in your case) and, optionally, a range key and those will be indexed. So, you could do this:
response = table.get_item(Key={'subscription_id': mysubid})
and it will work as expected. However, if you want to retrieve an item based on order_number you would have to use a scan operation which looks through all items in your table to find the one(s) with the correct value. This is a very expensive operation. Or you could create a Global Secondary Index in your table that uses order_number as the primary key. If you did that and called the new index order_number-index you could then query for objects that match a specific order number like this:
from boto3.dynamodb.conditions import Key, Attr
response = table.query(
IndexName='order_number-index',
KeyConditionExpression=Key('order_number').eq(myordernumber))
DynamoDB is an very fast, scalable, and efficient database but it does require a lot of thought about what fields you might want to search on and how to do that efficiently.
The good news is that now you can add GSI's to an existing table. Previously you would have had to delete your table and start all over again.
Make sure you've imported this:
from boto3.dynamodb.conditions import Key, Attr
If you don't have it, you'll get the error for sure. It's in the documentation examples.
Thanks #altoids for the comment above as this is the correct answer for me. I wanted to bring attention to it with a "formal" answer.
To query dynamodb using Index with filter:
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb', region_name=region)
table = dynamodb.Table('<TableName>')
response = table.query(
IndexName='<Index>',
KeyConditionExpression=Key('<key1>').eq('<value>') & Key('<key2>').eq('<value>'),
FilterExpression=Attr('<attr>').eq('<value>')
)
print(response['Items'])
If filter is not rquired then don't use FilterExpression in query.
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb', region_name=region_name)
table = dynamodb.Table(tableName)
def queryDynamo(pk, sk):
response = table.query(
ProjectionExpression="#pk, #sk, keyA, keyB",
ExpressionAttributeNames={"#pk": "pk", "#sk": "sk"},
KeyConditionExpression=
Key('pk').eq(pk) & Key('sk').eq(sk)
)
return response['Items']
If you use the boto3 dynamodb client, you can do the following (again you would need to use subscription_id as that is the primary key):
dynamodb = boto3.client('dynamodb')
response = dynamodb.query(
TableName='recurring_charges',
KeyConditionExpression="subscription_id = :subscription_id",
ExpressionAttributeValues={":subscription_id": {"S": "id"}}
)
So far, this is the cleanest way I've discovered; the query is in JSON format.
dynamodb_client = boto3.client('dynamodb')
def query_items():
arguments = {
"TableName": "your_dynamodb_table",
"IndexName": "order_number-index",
"KeyConditionExpression": "order_number = :V1",
"ExpressionAttributeValues": {":V1": {"S": "value"}},
}
return dynamodb_client.query(**arguments)