transferring rdf to 4store - python

actually I have a code named rdf.py that generates rdf code ..what I want to do is to directly move that file in 4store.. I have stored the entire code in a variable and want to directly pass that variable to 4store.. is it possible?
the code of rdf.py is below.
rdf_code contains the entire rdf code that is generated
import rdflib
from rdflib.events import Dispatcher, Event
from rdflib.graph import ConjunctiveGraph as Graph
from rdflib import plugin
from rdflib.store import Store, NO_STORE, VALID_STORE
from rdflib.namespace import Namespace
from rdflib.term import Literal
from rdflib.term import URIRef
from tempfile import mkdtemp
from gstudio.models import *
from objectapp.models import *
from reversion.models import Version
from optparse import make_option
def get_nodetype(name):
"""
returns the model the id belongs to.
"""
try:
"""
ALGO: get object id, go to version model, return for the given id.
"""
node = NID.objects.get(title=str(name))
# Retrieving only the relevant tupleset for the versioned objects
vrs = Version.objects.filter(type=0 , object_id=node.id)
# Returned value is a list, so splice it .
vrs = vrs[0]
except Error:
return "The item was not found."
return vrs.object._meta.module_name
def rdf_description(name, notation='xml' ):
"""
Function takes title of node, and rdf notation.
"""
valid_formats = ["xml", "n3", "ntriples", "trix"]
default_graph_uri = "http://gstudio.gnowledge.org/rdfstore"
configString = "/var/tmp/rdfstore"
# Get the Sleepycat plugin.
store = plugin.get('IOMemory', Store)('rdfstore')
# Open previously created store, or create it if it doesn't exist yet
graph = Graph(store="IOMemory",
identifier = URIRef(default_graph_uri))
path = mkdtemp()
rt = graph.open(path, create=False)
if rt == NO_STORE:
#There is no underlying Sleepycat infrastructure, create it
graph.open(path, create=True)
else:
assert rt == VALID_STORE, "The underlying store is corrupt"
# Now we'll add some triples to the graph & commit the changes
# rdflib = Namespace('http://sbox.gnowledge.org/gstudio/')
graph.bind("gstudio", "http://gnowledge.org/")
exclusion_fields = ["id", "rght", "node_ptr_id", "image", "lft", "_state", "_altnames_cache", "_tags_cache", "nid_ptr_id", "_mptt_cached_fields"]
node_type=get_nodetype(name)
if (node_type=='gbobject'):
node=Gbobject.objects.get(title=name)
elif (node_type=='objecttype'):
node=Objecttype.objects.get(title=name)
elif (node_type=='metatype'):
node=Metatype.objects.get(title=name)
elif (node_type=='attributetype'):
node=Attributetype.objects.get(title=name)
elif (node_type=='relationtype'):
node=Relationtype.objects.get(title=name)
elif (node_type=='attribute'):
node=Attribute.objects.get(title=name)
elif (node_type=='complement'):
node=Complement.objects.get(title=name)
elif (node_type=='union'):
node=Union.objects.get(title=name)
elif (node_type=='intersection'):
node=Intersection.objects.get(title=name)
elif (node_type=='expression'):
node=Expression.objects.get(title=name)
elif (node_type=='processtype'):
node=Processtype.objects.get(title=name)
elif (node_type=='systemtype'):
node=Systemtype.objects.get(title=name)
node_url=node.get_absolute_url()
site_add= node.sites.all()
a = site_add[0]
host_name =a.name
#host_name=name
link='http://'
#Concatenating the above variables will give the url address.
url_add=link+host_name+node_url
rdflib = Namespace(url_add)
# node=Objecttype.objects.get(title=name)
node_dict=node.__dict__
subject=str(node_dict['id'])
for key in node_dict:
if key not in exclusion_fields:
predicate=str(key)
pobject=str(node_dict[predicate])
graph.add((rdflib[subject], rdflib[predicate], Literal(pobject)))
rdf_code= graph.serialize(format=notation)
# print out all the triples in the graph
for subject, predicate, object in graph:
print subject, predicate, object
graph.commit()
print rdf_code
graph.close()
can I directly pass the rdf_code to 4store...if yes then how?

The simplest way to do this is to transform that graph into ntriples and send it to http://yourhost:port/data/GRAPH_URI. If you do an HTTP POST then the triples will be appended to the existing graph represented by GRAPH_URI. If you do a HTTP PUT then the current graph will be replaced. If the graph does not exist then it will be created no matter if you POST or PUT.
Taking this function as example:
def assert4s(data,epr,graph,contenttype,flush=False):
try:
params = urllib.urlencode({'graph': graph,
'data': data,
'mime-type' : contenttype })
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(epr,params)
request.get_method = lambda: ('PUT' if flush else 'POST')
url = opener.open(request)
return url.read()
except Exception, e:
raise e
If you had the following data:
triples = """<a> <b> <c> .
<d> <e> <f> .
"""
You can do the following call:
assert4s(triples,
"http://yourhost:port/data/",
"http://some.org/graph/id",
"application/x-turtle")
Edit
My previous answer assumed you were using the 4s-httpd server. You can start the SPARQL server in 4store with the following command 4s-httpd -p PORT kb_name. Once you have this running, you can use the following services for:
http://localhost:port/sparql/ to submit queries
http://localhost:port/data/ to PUT or POST data files.
http://localhost:port/update/ to submit SPARQL updates queries.
The 4store SPARQLServer documentation is quite complete.

Related

Can I Limit Value used by python in Uri of an Api Gateway

Asking a new question as my original was closed by admin who said a similar question had the answer. I checked it and it didnt have the answer. That answer was related to user input in Python itself. I am using a front end in the form of an Api Uri which I have included in this question.
In my code below I use this to call an Api using 'Method' and 'Value' parameters. For example on the URL they use to call the Api they enter Method2=Min and Value2=2 This works fine and allows the user to input whatever they want into Value2 so they can change the settings of an AutoScaling Group.
If the user makes a mistake and inputs Value as 100 instead of 10 in the Uri, for example Value3=100, I would not want the Autoscaling Group Max size to change to 100. Therefore is there of editing my script so I can stop that Max size from changing to 100? Like some sort of max size value in the script itself?
To invoke the Api I use powershell command like this
invoke-webrequest -Uri 'https://name-of-api.amazonaws.com/?TagKey=tag-key-name&TagValue=tag-value-name&Method1=Capacity&Value1=1&Method2=Min&Value2=1&Method3=Max&Value3=4' -Headers #{"X-Api-Key"="name-of-api-key"}
As you can see the user has the power to change 'Method3=Max' to have a 'Value3' of 100 if they want
Thanks
import boto3
import botocore
import os
import json
import logging
# Set up logger.
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
# Set up clients.
client = boto3.client('autoscaling')
def handler(event, context):
# Step one: get all AutoScaling Groups.
response = client.describe_auto_scaling_groups(
MaxRecords=100
)
# Make an empty list for ASG info storage.
allAutoScaling = [] # empty list of no items.
# Get the initial results and posts them to CloudWatch logs
allAutoScaling.append(response['AutoScalingGroups'])
logger.info(allAutoScaling)
# If 'Marker' is present in the response, we have more to get.
while 'Marker' in response:
old_marker = response['Marker']
response = client.describe_auto_scaling_groups(
MaxRecords=100,
Marker = old_marker
)
allAutoScaling.append(response['AutoScalingGroups'])
# Cycles back up to repeat the pagination.
# Now to find the tags specified in the Api Uri
for autoscaling in allAutoScaling:
for key in autoscaling:
for tag in key['Tags']:
if tag['Key'] == event['TagKey'] and tag['Value'] == event['TagValue']:
if event['Method2'] == "Min":
if event['Value2']:
Value2 = int(event['Value2'])
response = client.update_auto_scaling_group(
AutoScalingGroupName=key['AutoScalingGroupName'],
MinSize=Value2
)
else:
print('Min already at required level')
if event['Method3'] == "Max":
if event['Value3']:
Value3 = int(event['Value3'])
response = client.update_auto_scaling_group(
AutoScalingGroupName=key['AutoScalingGroupName'],
MaxSize=Value3
)
else:
print('Max already at required level')
if event['Method1'] == "Capacity":
if event['Value1']:
Value1 = int(event['Value1'])
response = client.set_desired_capacity(
AutoScalingGroupName=key['AutoScalingGroupName'],
DesiredCapacity=Value1
)
else:
print('Capacity already at required level')
Something within these lines:
try:
value = int(event['Value3'])
except ValueError:
print(f"{event['Value3']} is not a valid integer")
else:
if 1 <= value <= 3:
# do something, process the request
else:
print(f'Value must be between 1 and 3, got {value} instead')

Running Google Cloud DocumentAI sample code on Python returned the error 503

I am trying the example from the Google repo:
https://github.com/googleapis/python-documentai/blob/HEAD/samples/snippets/quickstart_sample.py
I have an error:
metadata=[('x-goog-request-params', 'name=projects/my_proj_id/locations/us/processors/my_processor_id'), ('x-goog-api-client', 'gl-python/3.8.10 grpc/1.38.1 gax/1.30.0 gapic/1.0.0')]), last exception: 503 DNS resolution failed for service: https://us-documentai.googleapis.com/v1/
My full code:
from google.cloud import documentai_v1 as documentai
import os
# TODO(developer): Uncomment these variables before running the sample.
project_id= '123456789'
location = 'us' # Format is 'us' or 'eu'
processor_id = '1a23345gh823892' # Create processor in Cloud Console
file_path = 'document.jpg'
os.environ['GRPC_DNS_RESOLVER'] = 'native'
def quickstart(project_id: str, location: str, processor_id: str, file_path: str):
# You must set the api_endpoint if you use a location other than 'us', e.g.:
opts = {}
if location == "eu":
opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}:process"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "image/jpeg"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)
document = result.document
document_pages = document.pages
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
paragraphs = page.paragraphs
for paragraph in paragraphs:
print(paragraph)
paragraph_text = get_text(paragraph.layout, document)
print(f"Paragraph text: {paragraph_text}")
def get_text(doc_element: dict, document: dict):
"""
Document AI identifies form fields by their offsets
in document text. This function converts offsets
to text snippets.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in doc_element.text_anchor.text_segments:
start_index = (
int(segment.start_index)
if segment in doc_element.text_anchor.text_segments
else 0
)
end_index = int(segment.end_index)
response += document.text[start_index:end_index]
return response
def main ():
quickstart (project_id = project_id, location = location, processor_id = processor_id, file_path = file_path)
if __name__ == '__main__':
main ()
FYI, on the Google Cloud website it stated that the endpoint is:
https://us-documentai.googleapis.com/v1/projects/123456789/locations/us/processors/1a23345gh823892:process
I can use the web interface to run DocumentAI so it is working. I just have the problem with Python code.
Any suggestion is appreciated.
I would suspect the GRPC_DNS_RESOLVER environment variable to be the root cause. Did you try with the corresponding line commented out? Why was it added in your code?

Reset index name in elasticsearch dsl

I'm trying to create an ETL that extracts from mongo, process the data and loads into elastic. I will do a daily load so I thought of naming my index with the current date. This will help me for a later processing I need to do with this first index.
I used elasticsearch dsl guide: https://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html
The problem that I have comes from my little experience with working with classes. I don't know how to reset the Index name from the class.
Here is my code for the class (custom_indices.py):
from elasticsearch_dsl import Document, Date, Integer, Keyword, Text
from elasticsearch_dsl.connections import connections
from elasticsearch_dsl import Search
import datetime
class News(Document):
title = Text(analyzer='standard', fields={'raw': Keyword()})
manual_tagging = Keyword()
class Index:
name = 'processed_news_'+datetime.datetime.now().strftime("%Y%m%d")
def save(self, ** kwargs):
return super(News, self).save(** kwargs)
def is_published(self):
return datetime.now() >= self.processed
And this is the part of the code where I create the instance to that class:
from custom_indices import News
import elasticsearch
import elasticsearch_dsl
from elasticsearch_dsl.connections import connections
import pandas as pd
import datetime
connections.create_connection(hosts=['localhost'])
News.init()
for index, doc in df.iterrows():
new_insert = News(meta={'id': doc.url_hashed},
title = doc.title,
manual_tagging = doc.customTags,
)
new_insert.save()
Every time I call the "News" class I would expect to have a new name. However, the name doesn't change even if I load the class again (from custom_indices import News). I know this is only a problem I have when testing but I'd like to know how to force that "reset". Actually, I originally wanted to change the name outside the class with this line right before the loop:
News.Index.name = "NEW_NAME"
However, that didn't work. I was still seeing the name defined on the class.
Could anyone please assist?
Many thanks!
PS: this must be just an object oriented programming issue. Apologies for my ignorance on the subject.
Maybe you could take advantage of the fact that Document.init() accepts an index keyword argument. If you want the index name to get set automatically, you could implement init() in the News class and call super().init(...) in your implementation.
A simplified example (python 3.x):
from elasticsearch_dsl import Document
from elasticsearch_dsl.connections import connections
import datetime
class News(Document):
#classmethod
def init(cls, index=None, using=None):
index_name = index or 'processed_news_' + datetime.datetime.now().strftime("%Y%m%d")
return super().init(index=index_name, using=using)
You can override the index when you call save() .
new_insert.save('processed_news_' + datetime.datetime.now().strftime("%Y%m%d"))
Example as following.
# coding: utf-8
import datetime
from elasticsearch_dsl import Keyword, Text, \
Index, Document, Date
from elasticsearch_dsl.connections import connections
HOST = "localhost:9200"
index_names = [
"foo-log-",
"bar-log-",
]
default_settings = {"number_of_shards": 4, "number_of_replicas": 1}
index_settings = {
"foo-log-": {
"number_of_shards": 40,
"number_of_replicas": 1
}
}
class LogDoc(Document):
level = Keyword(ignore_above=256)
date = Date(format="yyyy-MM-dd'T'HH:mm:ss.SSS")
hostname = Text(fields={'fields': Keyword(ignore_above=256)})
message = Text()
createTime = Date(format="yyyy-MM-dd'T'HH:mm:ss.SSS")
def auto_create_index():
'''自动创建ES索引'''
connections.create_connection(hosts=[HOST])
for day in range(3):
dt = datetime.datetime.now() + datetime.timedelta(days=day)
for index in index_names:
name = index + dt.strftime("%Y-%m-%d")
settings = index_settings.get(index, default_settings)
idx = Index(name=name)
idx.document(LogDoc)
idx.settings(**settings)
try:
idx.create()
except Exception as e:
print(e)
continue
print("create index %s" % name)
if __name__ == '__main__':
auto_create_index()

script to serve from url, for requests matching regular expression

I am a complete n00b in Python and am trying to figure out a stub for mitmproxy.
I have tried the documentation but they assume we know Python so i am at a stalemate.
I've been working with a script:
original_url = 'http://production.domain.com/1/2/3'
new_content_path = '/home/andrepadez/proj/main.js'
body = open(new_content_path, 'r').read()
def response(context, flow):
url = flow.request.get_url()
if url == original_url:
flow.response.content = body
As you can predict, the proxy takes every request to 'http://production.domain.com/1/2/3' and serves the content of my file.
I need this to be more dynamic:
for every request to 'http://production.domain.com/*', i need to serve a correspondent URL, for example:
http://production.domain.com/1/4/3 -> http://develop.domain.com/1/4/3
I know i have to use a regular expression, so i can capture and map it correctly, but i don't know how to serve the contents of the develop url as "flow.response.content".
Any help will be welcome
You would have to do something like this:
import re
# In order not to re-read the original file every time, we maintain
# a cache of already-read bodies.
bodies = { }
def response(context, flow):
# Intercept all URLs
url = flow.request.get_url()
# Check if this URL is one of "ours" (check out Python regexps)
m = re.search('REGEXP_FOR_ORIGINAL_URL/(\d+)/(\d+)/(\d+)', url)
if None != m:
# It is, and m will contain this information
# The three numbers are in m.group(1), (2), (3)
key = "%d.%d.%d" % ( m.group(1), m.group(2), m.group(3) )
try:
body = bodies[key]
except KeyError:
# We do not yet have this body
body = // whatever is necessary to retrieve this body
= open("%s.txt" % ( key ), 'r').read()
bodies[key] = body
flow.response.content = body

Fetching language detection from Google api

I have a CSV with keywords in one column and the number of impressions in a second column.
I'd like to provide the keywords in a url (while looping) and for the Google language api to return what type of language was the keyword in.
I have it working manually. If I enter (with the correct api key):
http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=myapikey&q=merde
I get:
{"responseData": {"language":"fr","isReliable":false,"confidence":6.213709E-4}, "responseDetails": null, "responseStatus": 200}
which is correct, 'merde' is French.
so far I have this code but I keep getting server unreachable errors:
import time
import csv
from operator import itemgetter
import sys
import fileinput
import urllib2
import json
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
#not working
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
# Deserialize JSON to Python objects
result_object = json.loads(result)
#Get the rows in the table, then get the second column's value
# for each row
return row in result_object
#not working
def retrieve_terms(seedterm):
print(seedterm)
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=myapikey&q=%(seed)s'
url = url_template % {"seed": seedterm}
try:
with urllib2.urlopen(url) as data:
data = perform_request(seedterm)
result = data.read()
except:
sys.stderr.write('%s\n' % 'Could not request data from server')
exit(E_OPERATION_ERROR)
#terms = parse_result(result)
#print terms
print result
def main(argv):
filename = argv[1]
csvfile = open(filename, 'r')
csvreader = csv.DictReader(csvfile)
rows = []
for row in csvreader:
rows.append(row)
sortedrows = sorted(rows, key=itemgetter('impressions'), reverse = True)
keys = sortedrows[0].keys()
for item in sortedrows:
retrieve_terms(item['keywords'])
try:
outputfile = open('Output_%s.csv' % (filename),'w')
except IOError:
print("The file is active in another program - close it first!")
sys.exit()
dict_writer = csv.DictWriter(outputfile, keys, lineterminator='\n')
dict_writer.writer.writerow(keys)
dict_writer.writerows(sortedrows)
outputfile.close()
print("File is Done!! Check your folder")
if __name__ == '__main__':
start_time = time.clock()
main(sys.argv)
print("\n")
print time.clock() - start_time, "seconds for script time"
Any idea how to finish the code so that it will work? Thank you!
Try to add referrer, userip as described in the docs:
An area to pay special attention to
relates to correctly identifying
yourself in your requests.
Applications MUST always include a
valid and accurate http referer header
in their requests. In addition, we
ask, but do not require, that each
request contains a valid API Key. By
providing a key, your application
provides us with a secondary
identification mechanism that is
useful should we need to contact you
in order to correct any problems. Read
more about the usefulness of having an
API key
Developers are also encouraged to make
use of the userip parameter (see
below) to supply the IP address of the
end-user on whose behalf you are
making the API request. Doing so will
help distinguish this legitimate
server-side traffic from traffic which
doesn't come from an end-user.
Here's an example based on the answer to the question "access to google with python":
#!/usr/bin/python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2
from pprint import pprint
api_key, userip = None, None
query = {'q' : 'матрёшка'}
referrer = "https://stackoverflow.com/q/4309599/4279"
if userip:
query.update(userip=userip)
if api_key:
query.update(key=api_key)
url = 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s' %(
urllib.urlencode(query))
request = urllib2.Request(url, headers=dict(Referer=referrer))
json_data = json.load(urllib2.urlopen(request))
pprint(json_data['responseData'])
Output
{u'confidence': 0.070496580000000003, u'isReliable': False, u'language': u'ru'}
Another issue might be that seedterm is not properly quoted:
if isinstance(seedterm, unicode):
value = seedterm
else: # bytes
value = seedterm.decode(put_encoding_here)
url = 'http://...q=%s' % urllib.quote_plus(value.encode('utf-8'))

Categories

Resources