Running Google Cloud DocumentAI sample code on Python returned the error 503 - python

I am trying the example from the Google repo:
https://github.com/googleapis/python-documentai/blob/HEAD/samples/snippets/quickstart_sample.py
I have an error:
metadata=[('x-goog-request-params', 'name=projects/my_proj_id/locations/us/processors/my_processor_id'), ('x-goog-api-client', 'gl-python/3.8.10 grpc/1.38.1 gax/1.30.0 gapic/1.0.0')]), last exception: 503 DNS resolution failed for service: https://us-documentai.googleapis.com/v1/
My full code:
from google.cloud import documentai_v1 as documentai
import os
# TODO(developer): Uncomment these variables before running the sample.
project_id= '123456789'
location = 'us' # Format is 'us' or 'eu'
processor_id = '1a23345gh823892' # Create processor in Cloud Console
file_path = 'document.jpg'
os.environ['GRPC_DNS_RESOLVER'] = 'native'
def quickstart(project_id: str, location: str, processor_id: str, file_path: str):
# You must set the api_endpoint if you use a location other than 'us', e.g.:
opts = {}
if location == "eu":
opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}:process"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "image/jpeg"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)
document = result.document
document_pages = document.pages
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
paragraphs = page.paragraphs
for paragraph in paragraphs:
print(paragraph)
paragraph_text = get_text(paragraph.layout, document)
print(f"Paragraph text: {paragraph_text}")
def get_text(doc_element: dict, document: dict):
"""
Document AI identifies form fields by their offsets
in document text. This function converts offsets
to text snippets.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in doc_element.text_anchor.text_segments:
start_index = (
int(segment.start_index)
if segment in doc_element.text_anchor.text_segments
else 0
)
end_index = int(segment.end_index)
response += document.text[start_index:end_index]
return response
def main ():
quickstart (project_id = project_id, location = location, processor_id = processor_id, file_path = file_path)
if __name__ == '__main__':
main ()
FYI, on the Google Cloud website it stated that the endpoint is:
https://us-documentai.googleapis.com/v1/projects/123456789/locations/us/processors/1a23345gh823892:process
I can use the web interface to run DocumentAI so it is working. I just have the problem with Python code.
Any suggestion is appreciated.

I would suspect the GRPC_DNS_RESOLVER environment variable to be the root cause. Did you try with the corresponding line commented out? Why was it added in your code?

Related

Python wildcard search

I have a Lambda python function that I inherited which searches and reports on installed packages on EC2 instances. It pulls this information from SSM Inventory where the results are output to an S3 bucket. All of the installed packages have specific names until now. Now we need to report on Palo Alto Cortex XDR. The issue I'm facing is that this product includes the version number in the name and we have different versions installed. If I use the exact name (i.e. Cortex XDR 7.8.1.11343) I get reporting on that particular version but not others. I want to use a wild card to do this. I import regex (import re) on line 7 and then I change line 71 to xdr=line['Cortex*']) but it gives me the following error. I'm a bit new to Python and coding so any explanation as to what I'm doing wrong would be helpful.
File "/var/task/SoeSoftwareCompliance/RequiredSoftwareEmail.py", line 72, in build_html
xdr=line['Cortex*'])
import configparser
import logging
import csv
import json
from jinja2 import Template
import boto3
import re
# config
config = configparser.ConfigParser()
config.read("config.ini")
# logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# #TODO
# refactor common_csv_header so that we use one with variable
# so that we write all content to one template file.
def build_html(account=None,
ses_email_address=None,
recipient_email=None):
"""
:param recipient_email:
:param ses_email_address:
:param account:
"""
account_id = account["id"]
account_alias = account["alias"]
linux_ec2s = []
windows_ec2s = []
ec2s_not_in_ssm = []
excluded_ec2s = []
# linux ec2s html
with open(f"/tmp/{account_id}_linux_ec2s_required_software_report.csv", "r") as fp:
lines = csv.DictReader(fp)
for line in lines:
if line["platform-type"] == "Linux":
item = dict(id=line['instance-id'],
name=line['instance-name'],
ip=line['ip-address'],
ssm=line['amazon-ssm-agent'],
cw=line['amazon-cloudwatch-agent'],
ch=line['cloudhealth-agent'])
# skip compliant linux ec2s where are values are found
compliance_status = not all(item.values())
if compliance_status:
linux_ec2s.append(item)
# windows ec2s html
with open(f"/tmp/{account_id}_windows_ec2s_required_software_report.csv", "r") as fp:
lines = csv.DictReader(fp)
for line in lines:
if line["platform-type"] == "Windows":
item = dict(id=line['instance-id'],
name=line['instance-name'],
ip=line['ip-address'],
ssm=line['Amazon SSM Agent'],
cw=line['Amazon CloudWatch Agent'],
ch=line['CloudHealth Agent'],
mav=line['McAfee VirusScan Enterprise'],
trx=line['Trellix Agent'],
xdr=line['Cortex*'])
# skip compliant windows ec2s where are values are found
compliance_status = not all(item.values())
if compliance_status:
windows_ec2s.append(item)
# ec2s not found in ssm
with open(f"/tmp/{account_id}_ec2s_not_in_ssm.csv", "r") as fp:
lines = csv.DictReader(fp)
for line in lines:
item = dict(name=line['instance-name'],
id=line['instance-id'],
ip=line['ip-address'],
pg=line['patch-group'])
ec2s_not_in_ssm.append(item)
# display or hide excluded ec2s from report
display_excluded_ec2s_in_report = json.loads(config.get("settings", "display_excluded_ec2s_in_report"))
if display_excluded_ec2s_in_report == "true":
with open(f"/tmp/{account_id}_excluded_from_compliance.csv", "r") as fp:
lines = csv.DictReader(fp)
for line in lines:
item = dict(id=line['instance-id'],
name=line['instance-name'],
pg=line['patch-group'])
excluded_ec2s.append(item)
# pass data to html template
with open('templates/email.html') as file:
template = Template(file.read())
# pass parameters to template renderer
html = template.render(
linux_ec2s=linux_ec2s,
windows_ec2s=windows_ec2s,
ec2s_not_in_ssm=ec2s_not_in_ssm,
excluded_ec2s=excluded_ec2s,
account_id=account_id,
account_alias=account_alias)
# consolidated html with multiple tables
tables_html_code = html
client = boto3.client('ses')
client.send_email(
Destination={
'ToAddresses': [recipient_email],
},
Message={
'Body': {
'Html':
{'Data': tables_html_code}
},
'Subject': {
'Charset': 'UTF-8',
'Data': f'SOE | Software Compliance | {account_alias}',
},
},
Source=ses_email_address,
)
print(tables_html_code)
If I understand your problem correctly, you are getting a KeyError exception because Python does not support wildcards out of the box. A csv.DictReader creates a standard Python dictionary for each row in csv. Python's dictionary is just an associative array without pattern matching.
You can implement this by regex, though. If you have a dictionary line and you don't know the full name of a key you are looking for, you can solve it by re.search function.
line = {'Cortex XDR 7.8.1.11343': 'Some value you are looking for'}
val = next(v for k, v in line.items() if re.search('Cortex.+', k))
print(val) # 'Some value you are looking for'
Be aware that this assumes that a line dictionary contains at least one item that matches the 'Cortex.+' pattern and returns the first match. You would have to refactor this a bit to change this.
1. import os - missing in the code
2. def build_html(account=None -> When the account is pass with Nonetype and below error will thrown in account["id"] and account["alias"].
Ex:
Traceback (most recent call last):
File "C:\Users\pro\Documents\project\pywilds.py", line 134, in <module>
build_html(account=None)
File "C:\Users\pro\Documents\project\pywilds.py", line 33, in build_html
account_id = account["id"]
TypeError: 'NoneType' object is not subscriptable
I hope it helps..

amp azure playback speed python

Im trying to change play back speed in azure amp.
The following is the url generated from azure apis: https://ampdemo.azureedge.net/?url=https://testingmedia-usea.streaming.media.azure.net/bbd51d47-cc1a-4515-bac8-4053040f8c58/ignite.ism/manifest(format=mpd-time-cmaf,filter=filter1)&heuristicprofile=lowlatency
if you check that link there is no playback speed.
I saw the below link but dont know where to apply in python code
https://amp.azure.net/libs/amp/latest/docs/index.html#amp.player.options.playbackspeed
below is my code:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.mgmt.media import AzureMediaServices
from azure.storage.blob import BlobServiceClient
from azure.mgmt.media.models import (
Asset,
Transform,
TransformOutput,
BuiltInStandardEncoderPreset,
Job,
JobInputAsset,
JobOutputAsset,
OnErrorType,
Priority,
StreamingLocator,
AssetFilter,
PresentationTimeRange,
)
import os
import random
#Timer for checking job progress
import time
import requests
#Get environment variables
load_dotenv()
default_credential = DefaultAzureCredential(exclude_shared_token_cache_credential=True)
# Get the environment variables SUBSCRIPTIONID, RESOURCEGROUP and ACCOUNTNAME
subscription_id = os.getenv('SUBSCRIPTIONID')
resource_group = os.getenv('RESOURCEGROUP')
account_name = os.getenv('ACCOUNTNAME')
# The file you want to upload. For this example, put the file in the same folder as this script.
# The file ignite.mp4 has been provided for you.
source_file = "https://testingmedia.blob.core.windows.net/data/ignite.mp4"
#url = requests.get(source_file)
# This is a random string that will be added to the naming of things so that you don't have to keep doing this during testing
uniqueness = "streamAssetFilters-" + str(random.randint(0,9999))
# Change this to your specific streaming endpoint name if not using "default"
streaming_endpoint_name = "default"
# Set the attributes of the input Asset using the random number
in_asset_name = 'inputassetName' + uniqueness
in_alternate_id = 'inputALTid' + uniqueness
in_description = 'inputdescription' + uniqueness
# Create an Asset object
# The asset_id will be used for the container parameter for the storage SDK after the asset is created by the AMS client.
in_asset = Asset(alternate_id=in_alternate_id, description=in_description)
# Set the attributes of the output Asset using the random number
out_asset_name = 'outputassetName' + uniqueness
out_alternate_id = 'outputALTid' + uniqueness
out_description = 'outputdescription' + uniqueness
# Create an output asset object
out_asset = Asset(alternate_id=out_alternate_id, description=out_description)
# The AMS Client
print("Creating AMS Client")
client = AzureMediaServices(default_credential, subscription_id)
# Create an input Asset
print(f"Creating input asset {in_asset_name}")
input_asset = client.assets.create_or_update(resource_group, account_name, in_asset_name, in_asset)
# An AMS asset is a container with a specific id that has "asset-" prepended to the GUID.
# So, you need to create the asset id to identify it as the container
# where Storage is to upload the video (as a block blob)
in_container = 'asset-' + input_asset.asset_id
# create an output Asset
print(f"Creating output asset {out_asset_name}")
output_asset = client.assets.create_or_update(resource_group, account_name, out_asset_name, out_asset)
### Use the Storage SDK to upload the video ###
print(f"Uploading the file {source_file}")
blob_service_client = BlobServiceClient.from_connection_string(os.getenv('STORAGEACCOUNTCONNECTION'))
blob_client = blob_service_client.get_blob_client(in_container, "ignite.mp4")
# working_dir = os.getcwd() + "\Media"
# print(working_dir)
# print(f"Current working directory: {working_dir}")
# upload_file_path = os.path.join(working_dir, source_file)
# print(upload_file_path,"####")
# WARNING: Depending on where you are launching the sample from, the path here could be off, and not include the BasicEncoding folder.
# Adjust the path as needed depending on how you are launching this python sample file.
# Upload the video to storage as a block blob
#with open(url, "rb") as data:
blob_client.upload_blob_from_url(source_file)
transform_name = 'ContentAwareEncodingAssetFilters'
# Create a new Standard encoding Transform for Built-in Copy Codec
print(f"Creating Encoding transform named: {transform_name}")
# For this snippet, we are using 'BuiltInStandardEncoderPreset'
transform_output = TransformOutput(
preset=BuiltInStandardEncoderPreset(
preset_name="ContentAwareEncoding"
),
# What should we do with the job if there is an error?
on_error=OnErrorType.STOP_PROCESSING_JOB,
# What is the relative priority of this job to others? Normal, high or low?
relative_priority=Priority.NORMAL
)
print("Creating encoding transform...")
# Adding transform details
my_transform = Transform()
my_transform.description="Transform with Asset filters"
my_transform.outputs = [transform_output]
print(f"Creating transform {transform_name}")
transform = client.transforms.create_or_update(
resource_group_name=resource_group,
account_name=account_name,
transform_name=transform_name,
parameters=my_transform)
print(f"{transform_name} created (or updated if it existed already). ")
job_name = 'ContentAwareEncodingAssetFilters'+ uniqueness
print(f"Creating custom encoding job {job_name}")
files = (source_file)
# Create Job Input and Ouput Assets
input = JobInputAsset(asset_name=in_asset_name)
outputs = JobOutputAsset(asset_name=out_asset_name)
# Create the job object and then create transform job
the_job = Job(input=input, outputs=[outputs])
job: Job = client.jobs.create(resource_group, account_name, transform_name, job_name, parameters=the_job)
# Check job state
job_state = client.jobs.get(resource_group, account_name, transform_name, job_name)
# First check
print("First job check")
print(job_state.state)
# Check the state of the job every 10 seconds. Adjust time_in_seconds = <how often you want to check for job state>
def countdown(t):
while t:
mins, secs = divmod(t, 60)
timer = '{:02d}:{:02d}'.format(mins, secs)
print(timer, end="\r")
time.sleep(1)
t -= 1
job_current = client.jobs.get(resource_group, account_name, transform_name, job_name)
if(job_current.state == "Finished"):
print(job_current.state)
# TODO: Download the output file using blob storage SDK
return
if(job_current.state == "Error"):
print(job_current.state)
# TODO: Provide Error details from Job through API
return
else:
print(job_current.state)
countdown(int(time_in_seconds))
time_in_seconds = 10
countdown(int(time_in_seconds))
print(f"Creating locator for streaming...")
# Publish the output asset for streaming via HLS or DASH
locator_name = f"locator-{uniqueness}"
# Create the Asset filters
print("Creating an asset filter...")
asset_filter_name = 'filter1'
# Create the asset filter
asset_filter = client.asset_filters.create_or_update(
resource_group_name=resource_group,
account_name=account_name,
asset_name=out_asset_name,
filter_name=asset_filter_name,
parameters=AssetFilter(
# In this sample, we are going to filter the manifest by the time range of the presentation using the default timescale.
# You can adjust these settings for your own needs. Not that you can also control output tracks, and quality levels with a filter.
tracks=[],
# start_timestamp = 100000000 and end_timestamp = 300000000 using the default timescale will generate
# a play-list that contains fragments from between 10 seconds and 30 seconds of the VoD presentation.
# If a fragment straddles the boundary, the entire fragment will be included in the manifest.
presentation_time_range=PresentationTimeRange(start_timestamp=100000000, end_timestamp=300000000)
)
)
if asset_filter:
print(f"The asset filter ({asset_filter_name}) was successfully created.")
print()
else:
raise ValueError("There was an issue creating the asset filter.")
if output_asset:
streaming_locator = StreamingLocator(asset_name=out_asset_name, streaming_policy_name="Predefined_DownloadAndClearStreaming",filters=list(asset_filter_name.split(" ")))
locator = client.streaming_locators.create(
resource_group_name=resource_group,
account_name=account_name,
streaming_locator_name=locator_name,
parameters=streaming_locator
)
if locator:
print(f"The streaming locator {locator_name} was successfully created!")
else:
raise Exception(f"Error while creating streaming locator {locator_name}")
if locator.name:
hls_format = "format=m3u8-cmaf"
dash_format = "format=mpd-time-cmaf"
# Get the default streaming endpoint on the account
streaming_endpoint = client.streaming_endpoints.get(
resource_group_name=resource_group,
account_name=account_name,
streaming_endpoint_name=streaming_endpoint_name
)
if streaming_endpoint.resource_state != "Running":
print(f"Streaming endpoint is stopped. Starting endpoint named {streaming_endpoint_name}")
client.streaming_endpoints.begin_start(resource_group, account_name, streaming_endpoint_name)
basename_tup = os.path.splitext(source_file) # Extracting the filename and extension
path_extension = basename_tup[1] # Setting extension of the path
manifest_name = os.path.basename(source_file).replace(path_extension, "")
print(f"The manifest name is: {manifest_name}")
manifest_base = f"https://{streaming_endpoint.host_name}/{locator.streaming_locator_id}/{manifest_name}.ism/manifest"
hls_manifest = ""
if asset_filter_name is None:
hls_manifest = f'{manifest_base}({hls_format})'
else:
hls_manifest = f'{manifest_base}({hls_format},filter={asset_filter_name})'
print(f"The HLS (MP4) manifest URL is: {hls_manifest}")
print("Open the following URL to playback the live stream in an HLS compliant player (HLS.js, Shaka, ExoPlayer) or directly in an iOS device")
print({hls_manifest})
print()
dash_manifest = ""
if asset_filter_name is None:
dash_manifest = f'{manifest_base}({dash_format})'
else:
dash_manifest = f'{manifest_base}({dash_format},filter={asset_filter_name})'
print(f"The DASH manifest URL is: {dash_manifest}")
print("Open the following URL to playback the live stream from the LiveOutput in the Azure Media Player")
print(f"https://ampdemo.azureedge.net/?url={dash_manifest}&heuristicprofile=lowlatency")
print()
else:
raise ValueError("Locator was not created or Locator name is undefined.")
There's an example on https://amp.azure.net/libs/amp/latest/samples/dynamic_playback_speed.html for how to use playback speed. This is also available at https://github.com/Azure-Samples/azure-media-player-samples/blob/master/html/dynamic_playback_speed.html.

Using Boto3 (resource method) to create image - can't add tags to snapshots?

I have this script:
#!/usr/bin/env python3
import boto3
import argparse
import time
ec2 = boto3.resource('ec2')
dstamp = time.strftime("_%m-%d-%Y-0")
parser = argparse.ArgumentParser(description='Create Image(AMI) from Instance tag:Name Value')
parser.add_argument('names', nargs='+', type=str.upper, help='instance name or list of names to create images from')
args = parser.parse_args()
# List Source Instances for Image/Tag Creation
for instance in ec2.instances.all():
# Pull Name tags from source instances
for name in instance.tags:
if name["Key"] == 'Name':
instancename = name["Value"]
# Check for Name Tag Match with args
for iname in args.names:
if iname == instancename:
# Create an image if we have a match
image = instance.create_image(
Description=f"Created from Source: {instance.id} Name: {instancename}",
Name=instancename + dstamp,
NoReboot=True)
print('New: {} from instance.id: {} {}'.format(image.id, instance.id, instancename))
# ----------------------------------------------
# Can't copy tags from src instance - cause of auto-generated by Cloudformation Tags
# error I got: "Tag keys starting with 'aws:' are reserved for internal use"
# So we skip any tag [Key] named 'aws:'
# ----------------------------------------------
for tag in instance.tags:
dst_tags = []
if tag['Key'].startswith('aws:'):
print("Skip tag that starts with 'aws:' " + tag['Key'])
else:
dst_tags.append(tag)
print(' Tags:', dst_tags)
image.create_tags(Tags=dst_tags)
This is working perfectly, but the final function I am missing is to apply the tags to the underlying volume snapshots within the newly created image. Do I have to totally switch to client = boto3.client('ec2') in order to tag my volume snapshots?
To put it another way - how are people who are using images for backup tagging their volume snapshots?
I have been working with boto3 and python 3 for all of 3 weeks along with my regular duties, any help would be appreciated.

Python script - Blogger2Wordpress - how to save file?

I use the blogger2wordpress python script that Google released back in 2010 (https://code.google.com/archive/p/google-blog-converters-appengine/downloads), to convert a 95mb blogger export file to wordpress wxr format.
However, the script has this code:
#!/usr/bin/env python
# Copyright 2008 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os.path
import logging
import re
import sys
import time
from xml.sax.saxutils import unescape
import BeautifulSoup
import gdata
from gdata import atom
import iso8601
import wordpress
__author__ = 'JJ Lueck (EMAIL#gmail.com)'
###########################
# Constants
###########################
BLOGGER_URL = 'http://www.blogger.com/'
BLOGGER_NS = 'http://www.blogger.com/atom/ns#'
KIND_SCHEME = 'http://schemas.google.com/g/2005#kind'
YOUTUBE_RE = re.compile('http://www.youtube.com/v/([^&]+)&?.*')
YOUTUBE_FMT = r'[youtube=http://www.youtube.com/watch?v=\1]'
GOOGLEVIDEO_RE = re.compile('(http://video.google.com/googleplayer.swf.*)')
GOOGLEVIDEO_FMT = r'[googlevideo=\1]'
DAILYMOTION_RE = re.compile('http://www.dailymotion.com/swf/(.*)')
DAILYMOTION_FMT = r'[dailymotion id=\1]'
###########################
# Translation class
###########################
class Blogger2Wordpress(object):
"""Performs the translation of a Blogger export document to WordPress WXR."""
def __init__(self, doc):
"""Constructs a translator for a Blogger export file.
Args:
doc: The WXR file as a string
"""
# Ensure UTF8 chars get through correctly by ensuring we have a
# compliant UTF8 input doc.
self.doc = doc.decode('utf-8', 'replace').encode('utf-8')
# Read the incoming document as a GData Atom feed.
self.feed = atom.FeedFromString(self.doc)
self.next_id = 1
def Translate(self):
"""Performs the actual translation to WordPress WXR export format.
Returns:
A WordPress WXR export document as a string, or None on error.
"""
# Create the top-level document and the channel associated with it.
channel = wordpress.Channel(
title = self.feed.title.text,
link = self.feed.GetAlternateLink().href,
base_blog_url = self.feed.GetAlternateLink().href,
pubDate = self._ConvertPubDate(self.feed.updated.text))
posts_map = {}
for entry in self.feed.entry:
# Grab the information about the entry kind
entry_kind = ""
for category in entry.category:
if category.scheme == KIND_SCHEME:
entry_kind = category.term
if entry_kind.endswith("#comment"):
# This entry will be a comment, grab the post that it goes to
in_reply_to = entry.FindExtensions('in-reply-to')
post_item = None
# Check to see that the comment has a corresponding post entry
if in_reply_to:
post_id = self._ParsePostId(in_reply_to[0].attributes['ref'])
post_item = posts_map.get(post_id, None)
# Found the post for the comment, add the commment to it
if post_item:
# The author email may not be included in the file
author_email = ''
if entry.author[0].email:
author_email = entry.author[0].email.text
# Same for the the author's url
author_url = ''
if entry.author[0].uri:
author_url = entry.author[0].uri.text
post_item.comments.append(wordpress.Comment(
comment_id = self._GetNextId(),
author = entry.author[0].name.text,
author_email = author_email,
author_url = author_url,
date = self._ConvertDate(entry.published.text),
content = self._ConvertContent(entry.content.text)))
elif entry_kind.endswith('#post'):
# This entry will be a post
post_item = self._ConvertEntry(entry, False)
posts_map[self._ParsePostId(entry.id.text)] = post_item
channel.items.append(post_item)
elif entry_kind.endswith('#page'):
# This entry will be a static page
page_item = self._ConvertEntry(entry, True)
posts_map[self._ParsePageId(entry.id.text)] = page_item
channel.items.append(page_item)
wxr = wordpress.WordPressWxr(channel=channel)
return wxr.WriteXml()
def _ConvertEntry(self, entry, is_page):
"""Converts the contents of an Atom entry into a WXR post Item element."""
# A post may have an empty title, in which case the text element is None.
title = ''
if entry.title.text:
title = entry.title.text
# Check here to see if the entry points to a draft or regular post
status = 'publish'
if entry.control and entry.control.draft:
status = 'draft'
# If no link is present in the Blogger entry, just link
if entry.GetAlternateLink():
link = entry.GetAlternateLink().href
else:
link = BLOGGER_URL
# Declare whether this is a post of a page
post_type = 'post'
if is_page:
post_type = 'page'
blogger_blog = ''
blogger_permalink = ''
if entry.GetAlternateLink():
blogger_path_full = entry.GetAlternateLink().href.replace('http://', '')
blogger_blog = blogger_path_full.split('/')[0]
blogger_permalink = blogger_path_full[len(blogger_blog):]
# Create the actual item element
post_item = wordpress.Item(
title = title,
link = link,
pubDate = self._ConvertPubDate(entry.published.text),
creator = entry.author[0].name.text,
content = self._ConvertContent(entry.content.text),
post_id = self._GetNextId(),
post_date = self._ConvertDate(entry.published.text),
status = status,
post_type = post_type,
blogger_blog = blogger_blog,
blogger_permalink = blogger_permalink,
blogger_author = entry.author[0].name.text)
# Convert the categories which specify labels into wordpress labels
for category in entry.category:
if category.scheme == BLOGGER_NS:
post_item.labels.append(category.term)
return post_item
def _ConvertContent(self, text):
"""Unescapes the post/comment text body and replaces video content.
All <object> and <embed> tags in the post that relate to video must be
changed into the WordPress tags for embedding video,
e.g. [youtube=http://www.youtube.com/...]
If no text is provided, the empty string is returned.
"""
if not text:
return ''
# First unescape all XML tags as they'll be escaped by the XML emitter
content = unescape(text)
# Use an HTML parser on the body to look for video content
content_tree = BeautifulSoup.BeautifulSoup(content)
# Find the object tag
objs = content_tree.findAll('object')
for obj_tag in objs:
# Find the param tag within which contains the URL to the movie
param_tag = obj_tag.find('param', { 'name': 'movie' })
if not param_tag:
continue
# Get the video URL
video = param_tag.attrMap.get('value', None)
if not video:
continue
# Convert the video URL if necessary
video = YOUTUBE_RE.subn(YOUTUBE_FMT, video)[0]
video = GOOGLEVIDEO_RE.subn(GOOGLEVIDEO_FMT, video)[0]
video = DAILYMOTION_RE.subn(DAILYMOTION_FMT, video)[0]
# Replace the portion of the contents with the video
obj_tag.replaceWith(video)
return str(content_tree)
def _ConvertPubDate(self, date):
"""Translates to a pubDate element's time/date format."""
date_tuple = iso8601.parse_date(date)
return date_tuple.strftime('%a, %d %b %Y %H:%M:%S %z')
def _ConvertDate(self, date):
"""Translates to a wordpress date element's time/date format."""
date_tuple = iso8601.parse_date(date)
return date_tuple.strftime('%Y-%m-%d %H:%M:%S')
def _GetNextId(self):
"""Returns the next identifier to use in the export document as a string."""
next_id = self.next_id;
self.next_id += 1
return str(next_id)
def _ParsePostId(self, text):
"""Extracts the post identifier from a Blogger entry ID."""
matcher = re.compile('post-(\d+)')
matches = matcher.search(text)
return matches.group(1)
def _ParsePageId(self, text):
"""Extracts the page identifier from a Blogger entry ID."""
matcher = re.compile('page-(\d+)')
matches = matcher.search(text)
return matches.group(1)
if __name__ == '__main__':
if len(sys.argv) <= 1:
print 'Usage: %s <blogger_export_file>' % os.path.basename(sys.argv[0])
print
print ' Outputs the converted WordPress export file to standard out.'
sys.exit(-1)
wp_xml_file = open(sys.argv[1])
wp_xml_doc = wp_xml_file.read()
translator = Blogger2Wordpress(wp_xml_doc)
print translator.Translate()
wp_xml_file.close()
This scripts outputs the wxr file in the terminal window which is useless for me when the import file has tons of entries.
As I am not familiar with python, how can I modify the script to output the data into a .xml file?
Edit:
I did changed the end of the script to:
wp_xml_file = open(sys.argv[1])
wp_xml_doc = wp_xml_file.read()
translator = Blogger2Wordpress(wp_xml_doc)
print translator.Translate()
fh = open("testoutput.xml", "w")
fh.write(wp_xml_doc);
fh.close();
wp_xml_file.close()
But the produced file is an "invalid wxr file" :/
Can anybody help? Thanks!
Quick and dirty answer:
Output to the stdout is normal behaviour.
You might want to redirect it to a file for instance:
python2 blogger2wordpress your_blogger_export_file > backup
The output will be saved in the file named backup.
Or you can replace print translator.Translate() by
with open('output_file', 'w') as fd:
fd.write(translator.Translate())
This should do the trick (haven't tried).

transferring rdf to 4store

actually I have a code named rdf.py that generates rdf code ..what I want to do is to directly move that file in 4store.. I have stored the entire code in a variable and want to directly pass that variable to 4store.. is it possible?
the code of rdf.py is below.
rdf_code contains the entire rdf code that is generated
import rdflib
from rdflib.events import Dispatcher, Event
from rdflib.graph import ConjunctiveGraph as Graph
from rdflib import plugin
from rdflib.store import Store, NO_STORE, VALID_STORE
from rdflib.namespace import Namespace
from rdflib.term import Literal
from rdflib.term import URIRef
from tempfile import mkdtemp
from gstudio.models import *
from objectapp.models import *
from reversion.models import Version
from optparse import make_option
def get_nodetype(name):
"""
returns the model the id belongs to.
"""
try:
"""
ALGO: get object id, go to version model, return for the given id.
"""
node = NID.objects.get(title=str(name))
# Retrieving only the relevant tupleset for the versioned objects
vrs = Version.objects.filter(type=0 , object_id=node.id)
# Returned value is a list, so splice it .
vrs = vrs[0]
except Error:
return "The item was not found."
return vrs.object._meta.module_name
def rdf_description(name, notation='xml' ):
"""
Function takes title of node, and rdf notation.
"""
valid_formats = ["xml", "n3", "ntriples", "trix"]
default_graph_uri = "http://gstudio.gnowledge.org/rdfstore"
configString = "/var/tmp/rdfstore"
# Get the Sleepycat plugin.
store = plugin.get('IOMemory', Store)('rdfstore')
# Open previously created store, or create it if it doesn't exist yet
graph = Graph(store="IOMemory",
identifier = URIRef(default_graph_uri))
path = mkdtemp()
rt = graph.open(path, create=False)
if rt == NO_STORE:
#There is no underlying Sleepycat infrastructure, create it
graph.open(path, create=True)
else:
assert rt == VALID_STORE, "The underlying store is corrupt"
# Now we'll add some triples to the graph & commit the changes
# rdflib = Namespace('http://sbox.gnowledge.org/gstudio/')
graph.bind("gstudio", "http://gnowledge.org/")
exclusion_fields = ["id", "rght", "node_ptr_id", "image", "lft", "_state", "_altnames_cache", "_tags_cache", "nid_ptr_id", "_mptt_cached_fields"]
node_type=get_nodetype(name)
if (node_type=='gbobject'):
node=Gbobject.objects.get(title=name)
elif (node_type=='objecttype'):
node=Objecttype.objects.get(title=name)
elif (node_type=='metatype'):
node=Metatype.objects.get(title=name)
elif (node_type=='attributetype'):
node=Attributetype.objects.get(title=name)
elif (node_type=='relationtype'):
node=Relationtype.objects.get(title=name)
elif (node_type=='attribute'):
node=Attribute.objects.get(title=name)
elif (node_type=='complement'):
node=Complement.objects.get(title=name)
elif (node_type=='union'):
node=Union.objects.get(title=name)
elif (node_type=='intersection'):
node=Intersection.objects.get(title=name)
elif (node_type=='expression'):
node=Expression.objects.get(title=name)
elif (node_type=='processtype'):
node=Processtype.objects.get(title=name)
elif (node_type=='systemtype'):
node=Systemtype.objects.get(title=name)
node_url=node.get_absolute_url()
site_add= node.sites.all()
a = site_add[0]
host_name =a.name
#host_name=name
link='http://'
#Concatenating the above variables will give the url address.
url_add=link+host_name+node_url
rdflib = Namespace(url_add)
# node=Objecttype.objects.get(title=name)
node_dict=node.__dict__
subject=str(node_dict['id'])
for key in node_dict:
if key not in exclusion_fields:
predicate=str(key)
pobject=str(node_dict[predicate])
graph.add((rdflib[subject], rdflib[predicate], Literal(pobject)))
rdf_code= graph.serialize(format=notation)
# print out all the triples in the graph
for subject, predicate, object in graph:
print subject, predicate, object
graph.commit()
print rdf_code
graph.close()
can I directly pass the rdf_code to 4store...if yes then how?
The simplest way to do this is to transform that graph into ntriples and send it to http://yourhost:port/data/GRAPH_URI. If you do an HTTP POST then the triples will be appended to the existing graph represented by GRAPH_URI. If you do a HTTP PUT then the current graph will be replaced. If the graph does not exist then it will be created no matter if you POST or PUT.
Taking this function as example:
def assert4s(data,epr,graph,contenttype,flush=False):
try:
params = urllib.urlencode({'graph': graph,
'data': data,
'mime-type' : contenttype })
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(epr,params)
request.get_method = lambda: ('PUT' if flush else 'POST')
url = opener.open(request)
return url.read()
except Exception, e:
raise e
If you had the following data:
triples = """<a> <b> <c> .
<d> <e> <f> .
"""
You can do the following call:
assert4s(triples,
"http://yourhost:port/data/",
"http://some.org/graph/id",
"application/x-turtle")
Edit
My previous answer assumed you were using the 4s-httpd server. You can start the SPARQL server in 4store with the following command 4s-httpd -p PORT kb_name. Once you have this running, you can use the following services for:
http://localhost:port/sparql/ to submit queries
http://localhost:port/data/ to PUT or POST data files.
http://localhost:port/update/ to submit SPARQL updates queries.
The 4store SPARQLServer documentation is quite complete.

Categories

Resources