In google bigquery, how to use javascript UDF using google python client - python

I am writing a query in bigquery using standard SQL and javascript UDF. I am able to implement this using WebUI and bigquery command line tool but my requirement is to make this query using google python client. Not able to achieve this. Please can someone help.
from google.cloud import bigquery
bigquery_client = bigquery.Client()
client = bigquery.Client()
query_results = client.run_sync_query("""
CREATE TEMPORARY FUNCTION CategoriesToNumerical(a array<STRING>,b array<STRING>)
RETURNS string
LANGUAGE js AS """
var values = {};
var counter = 0;
for(i=0;i<a.length;i++)
{ var temp;
temp = a[i];
a[i] = counter;
values[temp] = counter;
counter ++;
}
for(i=0;i<b.length;i++)
{
for(var key in values)
{
if(b[i] == key)
{
b[i] = values[key];
}
}
}
return b;
""";
SELECT
CategoriesToNumerical(ARRAY(SELECT DISTINCT ProspectStage from lsq.lsq_dest),ARRAY(SELECT ProspectStage from lsq.lsq_dest)) as prospectstageds
;""")
query_results.use_legacy_sql = False
query_results.run()
page_token = None
while True:
rows1, total_rows, page_token = query_results.fetch_data(
max_results=100,
page_token=page_token)
for row1 in rows1:
print "row",row1
if not page_token:
break
This is not working for me.Please can someone help with how should I be going about it.

The problem seems to be you have here 2 sets of conflicting """. Replace one of these sets for a triple ''', and the code should work.
So instead of
query_results = client.run_sync_query("""
CREATE TEMPORARY FUNCTION CategoriesToNumerical(a array<STRING>,b array<STRING>)
RETURNS string
LANGUAGE js AS """
javacript code
"""
SELECT *
FROM
"""
write
query_results = client.run_sync_query('''
CREATE TEMPORARY FUNCTION CategoriesToNumerical(a array<STRING>,b array<STRING>)
RETURNS string
LANGUAGE js AS """
javacript code
"""
SELECT *
FROM
'''

Related

Problem with continuation token when querying from Cosmos DB

I'm facing a problem with continuation when querying items from CosmosDB.
I've already tried the following solution but with no success. I'm only able to query the first 10 results of a page even though I get a token that is not NULL.
The token has a size of 10733 bytes and looks like this.
{"token":"+RID:gtQwAJ9KbavOAAAAAAAAAA==#RT:1#TRC:10#FPP:AggAAAAAAAAAAGoAAAAAKAAAAAAAAAAAAADCBc6AEoAGgAqADoASgAaACoAOgBKABoAKgA6AE4AHgAuAD4ASgAeACoAPgBOAB4ALgA+AE4AHgAqAD4ASgAeAC4APgBOAB4ALgA+AE4AIgA2AEYAFgAmADYARgAaACYAPgBKABYAKgA6AE4AHgAuAD4ATgAeAC4APgBOAB4ALgA+AE4AIgAuAD4ATgAeAC4APgBOACIAMgA+AFIAIgAyAD4AUgAmADIAQgAWACIALgBCABIAIgAyAEIAEgAiADIAQgAOACYANgBKAB4AJgA6AEYAGgAqADoATgAeAC4APgB....etc...etc","range":{"min":"","max":"05C1BF3FB3CFC0"}}
Code looks like this. Function QueryDocuments did not work. Instead I had to use QueryItems.
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 10
q = client.QueryItems(collection_link, query, options)
results_1 = q._fetch_function(options)
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
data = list(q._fetch_function({'maxItemCount':10,'enableCrossPartitionQuery':True, 'continuation':token}))
Is there a solution to this? Thanks for your help.
Please use pydocumentdb package and refer to below sample code.
from pydocumentdb import document_client
endpoint = "https://***.documents.azure.com:443/";
primaryKey = "***";
client = document_client.DocumentClient(endpoint, {'masterKey': primaryKey})
collection_link = "dbs/db/colls/coll"
query = "select c.id from c"
query_with_optional_parameters = [];
q = client.QueryDocuments(collection_link, query, {'maxItemCount': 2})
results_1 = q._fetch_function({'maxItemCount': 2})
print(results_1)
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount': 2, 'continuation': token})
print(results_2)
Output:

Cosmos DB - Delete Document with Python

In this SO question I had learnt that I cannot delete a Cosmos DB document using SQL.
Using Python, I believe I need the DeleteDocument() method. This is how I'm getting the document ID's that are required (I believe) to then call the DeleteDocument() method.
# set up the client
client = document_client.DocumentClient()
# use a SQL based query to get a bunch of documents
query = { 'query': 'SELECT * FROM server s' }
result_iterable = client.QueryDocuments('dbs/DB/colls/coll', query, options)
results = list(result_iterable);
for x in range(0, len (results)):
docID = results[x]['id']
Now, at this stage I want to call DeleteDocument().
The inputs into which are document_link and options.
I can define document_link as something like
document_link = 'dbs/DB/colls/coll/docs/'+docID
And successfully call ReadAttachments() for example, which has the same inputs as DeleteDocument().
When I do however, I get an error...
The partition key supplied in x-ms-partitionkey header has fewer
components than defined in the the collection
...and now I'm totally lost
UPDATE
Following on from Jay's help, I believe I'm missing the partitonKey element in the options.
In this example, I've created a testing database, it looks like this
So I think my partition key is /testPART
When I include the partitionKey in the options however, no results are returned, (and so print len(results) outputs 0).
Removing partitionKey means that results are returned, but the delete attempt fails as before.
# Query them in SQL
query = { 'query': 'SELECT * FROM c' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
options['partitionKey'] = '/testPART'
result_iterable = client.QueryDocuments('dbs/testDB/colls/testCOLL', query, options)
results = list(result_iterable)
# should be > 0
print len(results)
for x in range(0, len (results)):
docID = results[x]['id']
print docID
client.DeleteDocument('dbs/testDB/colls/testCOLL/docs/'+docID, options=options)
print 'deleted', docID
According to your description, I tried to use pydocument module to delete document in my azure document db and it works for me.
Here is my code:
import pydocumentdb;
import pydocumentdb.document_client as document_client
config = {
'ENDPOINT': 'Your url',
'MASTERKEY': 'Your master key',
'DOCUMENTDB_DATABASE': 'familydb',
'DOCUMENTDB_COLLECTION': 'familycoll'
};
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['ENDPOINT'], {'masterKey': config['MASTERKEY']})
# use a SQL based query to get a bunch of documents
query = { 'query': 'SELECT * FROM server s' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
result_iterable = client.QueryDocuments('dbs/familydb/colls/familycoll', query, options)
results = list(result_iterable);
print(results)
client.DeleteDocument('dbs/familydb/colls/familycoll/docs/id1',options)
print 'delete success'
Console Result:
[{u'_self': u'dbs/hitPAA==/colls/hitPAL3OLgA=/docs/hitPAL3OLgABAAAAAAAAAA==/', u'myJsonArray': [{u'subId': u'sub1', u'val': u'value1'}, {u'subId': u'sub2', u'val': u'value2'}], u'_ts': 1507687788, u'_rid': u'hitPAL3OLgABAAAAAAAAAA==', u'_attachments': u'attachments/', u'_etag': u'"00002100-0000-0000-0000-59dd7d6c0000"', u'id': u'id1'}, {u'_self': u'dbs/hitPAA==/colls/hitPAL3OLgA=/docs/hitPAL3OLgACAAAAAAAAAA==/', u'myJsonArray': [{u'subId': u'sub3', u'val': u'value3'}, {u'subId': u'sub4', u'val': u'value4'}], u'_ts': 1507687809, u'_rid': u'hitPAL3OLgACAAAAAAAAAA==', u'_attachments': u'attachments/', u'_etag': u'"00002200-0000-0000-0000-59dd7d810000"', u'id': u'id2'}]
delete success
Please notice that you need to set the enableCrossPartitionQuery property to True in options if your documents are cross-partitioned.
Must be set to true for any query that requires to be executed across
more than one partition. This is an explicit flag to enable you to
make conscious performance tradeoffs during development time.
You could find above description from here.
Update Answer:
I think you misunderstand the meaning of partitionkey property in the options[].
For example , my container is created like this:
My documents as below :
{
"id": "1",
"name": "jay"
}
{
"id": "2",
"name": "jay2"
}
My partitionkey is 'name', so here I have two paritions : 'jay' and 'jay1'.
So, here you should set the partitionkey property to 'jay' or 'jay2',not 'name'.
Please modify your code as below:
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
options['partitionKey'] = 'jay' (please change here in your code)
result_iterable = client.QueryDocuments('dbs/db/colls/testcoll', query, options)
results = list(result_iterable);
print(results)
Hope it helps you.
Using the azure.cosmos library:
install and import azure cosmos package:
from azure.cosmos import exceptions, CosmosClient, PartitionKey
define delete items function - in this case using the partition key in query:
def deleteItems(deviceid):
client = CosmosClient(config.cosmos.endpoint, config.cosmos.primarykey)
# Create a database if not exists
database = client.create_database_if_not_exists(id=azure-cosmos-db-name)
# Create a container
# Using a good partition key improves the performance of database operations.
container = database.create_container_if_not_exists(id=container-name, partition_key=PartitionKey(path='/your-pattition-path'), offer_throughput=400)
#fetch items
query = f"SELECT * FROM c WHERE c.device.deviceid IN ('{deviceid}')"
items = list(container.query_items(query=query, enable_cross_partition_query=False))
for item in items:
container.delete_item(item, 'partition-key')
usage:
deviceid=10
deleteItems(items)
github full example here: https://github.com/eladtpro/python-iothub-cosmos

Graphene/Django (GraphQL): How to use a query argument in order to exclude nodes matching a specific filter?

I have some video items in a Django/Graphene backend setup. Each video item is linked to one owner.
In a React app, I would like to query via GraphQL all the videos owned by the current user on the one hand and all the videos NOT owned by the current user on the other hand.
I could run the following GraphQl query and filter on the client side:
query AllScenes {
allScenes {
edges {
node {
id,
name,
owner {
name
}
}
}
}
}
I would rather have two queries with filters parameters directly asking relevant data to my backend. Something like:
query AllScenes($ownerName : String!, $exclude: Boolean!) {
allScenes(owner__name: $ownerName, exclude: $exclude) {
edges {
node {
id,
name,
owner {
name
}
}
}
}
}
I would query with ownerName = currentUserName and exclude = True/False yet I just cannot retrieve my exclude argument on my backend side. Here is the code I have tried in my schema.py file:
from project.scene_manager.models import Scene
from graphene import ObjectType, relay, Int, String, Field, Boolean, Float
from graphene.contrib.django.filter import DjangoFilterConnectionField
from graphene.contrib.django.types import DjangoNode
from django_filters import FilterSet, CharFilter



class SceneNode(DjangoNode):

 class Meta:
model = Scene



class SceneFilter(FilterSet):
owner__name = CharFilter(lookup_type='exact', exclude=exclude)

class Meta:
model = Scene
fields = ['owner__name']

class Query(ObjectType):

scene = relay.NodeField(SceneNode)
all_scenes = DjangoFilterConnectionField(SceneNode, filterset_class=SceneFilter, exclude=Boolean())

def resolve_exclude(self, args, info):
exclude = args.get('exclude')
return exclude


class Meta:
abstract = True
My custom SceneFilter is used but I do not know how to pass the exclude arg to it. (I do not think that I am making a proper use of the resolver). Any help on that matter would be much appreciated!
Switching to graphene-django 1.0, I have been able to do what I wanted with the following query definition:
class Query(AbstractType):
selected_scenes = DjangoFilterConnectionField(SceneNode, exclude=Boolean())
def resolve_selected_scenes(self, args, context, info):
owner__name = args.get('owner__name')
exclude = args.get('exclude')
if exclude:
selected_scenes = Scene.objects.exclude(owner__name=owner__name)
else:
selected_scenes = Scene.objects.filter(owner__name=owner__name)
return selected_scenes
BossGrand proposed an other solution on GitHub

How to prevent triples from getting mixed up while uploading to Dydra programmatically?

I am trying to upload some data to Dydra from a Sesame triplestore I have on my computer. While the download from Sesame works fine, the triples get mixed up (the s-p-o relationships change as the object of one becomes object of another). Can someone please explain why this is happening and how it can be resolved? The code is below:
#Querying the triplestore to retrieve all results
sesameSparqlEndpoint = 'http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name'
sparql = SPARQLWrapper(sesameSparqlEndpoint)
queryStringDownload = 'SELECT * WHERE {?s ?p ?o}'
dataGraph = Graph()
sparql.setQuery(queryStringDownload)
sparql.method = 'GET'
sparql.setReturnFormat(JSON)
output = sparql.query().convert()
print output
for i in range(len(output['results']['bindings'])):
#The encoding is necessary to parse non-English characters
output['results']['bindings'][i]['s']['value'].encode('utf-8')
try:
subject_extract = output['results']['bindings'][i]['s']['value']
if 'http' in subject_extract:
subject = "<" + subject_extract + ">"
subject_url = URIRef(subject)
print subject_url
predicate_extract = output['results']['bindings'][i]['p']['value']
if 'http' in predicate_extract:
predicate = "<" + predicate_extract + ">"
predicate_url = URIRef(predicate)
print predicate_url
objec_extract = output['results']['bindings'][i]['o']['value']
if 'http' in objec_extract:
objec = "<" + objec_extract + ">"
objec_url = URIRef(objec)
print objec_url
else:
objec = objec_extract
objec_wip = '"' + objec + '"'
objec_url = URIRef(objec_wip)
# Loading the data on a graph
dataGraph.add((subject_url,predicate_url,objec_url))
except UnicodeError as error:
print error
#Print all statements in dataGraph
for stmt in dataGraph:
pprint.pprint(stmt)
# Upload to Dydra
URL = 'http://dydra.com/login'
key = 'my_key'
with requests.Session() as s:
resp = s.get(URL)
soup = BeautifulSoup(resp.text,"html5lib")
csrfToken = soup.find('meta',{'name':'csrf-token'}).get('content')
# print csrf_token
payload = {
'account[login]':key,
'account[password]':'',
'csrfmiddlewaretoken':csrfToken,
'next':'/'
}
# print payload
p = s.post(URL,data=payload, headers=dict(Referer=URL))
# print p.text
r = s.get('http://dydra.com/username/rep_name/sparql')
# print r.text
dydraSparqlEndpoint = 'http://dydra.com/username/rep_name/sparql'
for stmt in dataGraph:
queryStringUpload = 'INSERT DATA {%s %s %s}' % stmt
sparql = SPARQLWrapper(dydraSparqlEndpoint)
sparql.setCredentials(key,key)
sparql.setQuery(queryStringUpload)
sparql.method = 'POST'
sparql.query()
A far simpler way to copy your data over (apart from using a CONSTRUCT query instead of a SELECT, like I mentioned in the comment) is simply to have Dydra itself directly access your Sesame endpoint, for example via a SERVICE-clause.
Execute the following on your Dydra database, and (after some time, depending on how large your Sesame database is), everything will be copied over:
INSERT { ?s ?p ?o }
WHERE {
SERVICE <http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name>
{ ?s ?p ?o }
}
If the above doesn't work on Dydra, you can alternatively just directly access the RDF statements from your Sesame store by using the URI http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name/statements. Assuming Dydra has an upload-feature where you can provide the URL of an RDF document, you can simply provide it the above URI and it should be able to load it.
The code above can work if the following changes are made:
Use CONSTRUCT query instead of SELECT. Details here -> How to iterate over CONSTRUCT output from rdflib?
Use key as input for both account[login] and account[password]
However, this is probably not the most efficient way. Primarily, doing individual INSERTs for every triple is not a good way. Dydra doesn't record all statements this way (I got only about 30% of the triples inserted). On the contrary, using the http://my.ip.ad.here:8080/openrdf-sesame/repositories/rep_name/statements method as suggested by Jeen enabled me to port all the data successfully.

Empty a DynamoDB table with boto

How can I optimally (in terms financial cost) empty a DynamoDB table with boto? (as we can do in SQL with a truncate statement.)
boto.dynamodb2.table.delete() or boto.dynamodb2.layer1.DynamoDBConnection.delete_table() deletes the entire table, while boto.dynamodb2.table.delete_item() boto.dynamodb2.table.BatchTable.delete_item() only deletes the specified items.
While i agree with Johnny Wu that dropping the table and recreating it is much more efficient, there may be cases such as when many GSI's or Tirgger events are associated with a table and you dont want to have to re-associate those. The script below should work to recursively scan the table and use the batch function to delete all items in the table. For massively large tables though, this may not work as it requires all items in the table to be loaded into your computer
import boto3
dynamo = boto3.resource('dynamodb')
def truncateTable(tableName):
table = dynamo.Table(tableName)
#get the table keys
tableKeyNames = [key.get("AttributeName") for key in table.key_schema]
"""
NOTE: there are reserved attributes for key names, please see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ReservedWords.html
if a hash or range key is in the reserved word list, you will need to use the ExpressionAttributeNames parameter
described at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Table.scan
"""
#Only retrieve the keys for each item in the table (minimize data transfer)
ProjectionExpression = ", ".join(tableKeyNames)
response = table.scan(ProjectionExpression=ProjectionExpression)
data = response.get('Items')
while 'LastEvaluatedKey' in response:
response = table.scan(
ProjectionExpression=ProjectionExpression,
ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
with table.batch_writer() as batch:
for each in data:
batch.delete_item(
Key={key: each[key] for key in tableKeyNames}
)
truncateTable("YOUR_TABLE_NAME")
As Johnny Wu mentioned, deleting a table and re-creating it is more efficient than deleting individual items. You should make sure your code doesn't try to create a new table before it is completely deleted.
def deleteTable(table_name):
print('deleting table')
return client.delete_table(TableName=table_name)
def createTable(table_name):
waiter = client.get_waiter('table_not_exists')
waiter.wait(TableName=table_name)
print('creating table')
table = dynamodb.create_table(
TableName=table_name,
KeySchema=[
{
'AttributeName': 'YOURATTRIBUTENAME',
'KeyType': 'HASH'
}
],
AttributeDefinitions= [
{
'AttributeName': 'YOURATTRIBUTENAME',
'AttributeType': 'S'
}
],
ProvisionedThroughput={
'ReadCapacityUnits': 1,
'WriteCapacityUnits': 1
},
StreamSpecification={
'StreamEnabled': False
}
)
def emptyTable(table_name):
deleteTable(table_name)
createTable(table_name)
Deleting a table is much more efficient than deleting items one-by-one. If you are able to control your truncation points, then you can do something similar to rotating tables as suggested in the docs for time series data.
This builds on the answer given by Persistent Plants. If the table already exists, you can extract the table definitions and use that to recreate the table.
import boto3
dynamodb = boto3.resource('dynamodb', region_name='us-east-2')
def delete_table_ddb(table_name):
table = dynamodb.Table(table_name)
return table.delete()
def create_table_ddb(table_name, key_schema, attribute_definitions,
provisioned_throughput, stream_enabled, billing_mode):
settings = dict(
TableName=table_name,
KeySchema=key_schema,
AttributeDefinitions=attribute_definitions,
StreamSpecification={'StreamEnabled': stream_enabled},
BillingMode=billing_mode
)
if billing_mode == 'PROVISIONED':
settings['ProvisionedThroughput'] = provisioned_throughput
return dynamodb.create_table(**settings)
def truncate_table_ddb(table_name):
table = dynamodb.Table(table_name)
key_schema = table.key_schema
attribute_definitions = table.attribute_definitions
if table.billing_mode_summary:
billing_mode = 'PAY_PER_REQUEST'
else:
billing_mode = 'PROVISIONED'
if table.stream_specification:
stream_enabled = True
else:
stream_enabled = False
capacity = ['ReadCapacityUnits', 'WriteCapacityUnits']
provisioned_throughput = {k: v for k, v in table.provisioned_throughput.items() if k in capacity}
delete_table_ddb(table_name)
table.wait_until_not_exists()
return create_table_ddb(
table_name,
key_schema=key_schema,
attribute_definitions=attribute_definitions,
provisioned_throughput=provisioned_throughput,
stream_enabled=stream_enabled,
billing_mode=billing_mode
)
Now call use the function:
table_name = 'test_ddb'
truncate_table_ddb(table_name)

Categories

Resources