Query S3 from Python - python

I am using python to send a query to Athena and get table DDL. I am using start_query_execution and get_query_execution functions in the awswrangler package.
import boto3
import awswrangler as wr
import time
import pandas as pd
boto3.setup_default_session(region_name="us-east-1")
sql="show create table 'table-name'"
query_exec_id = wr.athena.start_query_execution(sql=sql, database='database-name')
time.sleep(20)
res=wr.athena.get_query_execution(query_execution_id=query_exec_id)
The code above creates a dict object that stores query results in an s3 link.
The link can be accessed by
res['ResultConfiguration']['OutputLocation']. It's a text link: s3://.....txt
Can someone help me figure how to access the output in the link. I tried using readlines() but it seemes to error out.
Here is what I did
import urllib3
target_url = res['ResultConfiguration']['OutputLocation']
f = urllib3.urlopen(target_url)
for l in f.readlines():
print (l)
Or if someone can suggest an easier way to get table DDL in python.

Keep in mind that the returned link will time out after a short while... and make sure your credentials allow you to get the data from the URL specified. If you drop the error message here we can help you better. –
Oh... "It's a text link: s3://.....txt" is not a standard URL. You cant read that with urllib3. You can use awswrangler to read the bucket. –
I think the form is
wr.s3.read_fwf(...)

Related

Tableau embedded data source refresh using python

Is there a way to refresh Tableau embedded datasource using python. I am currently using Tableau server client library to refresh published datasources which is actually working fine. Can someone help me to figure out a way?
The way you can reach them is kinda annoying from my perpective.
You need to use populate_connections() function to load embedded datasources. It would be easier if you know the name of the workbook.
import tableauserverclient as TSC
#sign in using personal access token
server = TSC.Server(server_address='server_name', use_server_version=True)
server.auth.sign_in_with_personal_access_token(auth_req=TSC.PersonalAccessTokenAuth(token_name='tokenName', personal_access_token='tokenValue', site_id='site_name'))
#use RequestOptions() with a filter to pull an specific workbook
def get_workbook(name):
req_opt = TSC.RequestOptions()
req_opt.filter.add(TSC.Filter(req_opt.Field.Name, req_opt.Operator.Equals, name))
return server.workbooks.get(req_opt)[0][0] #workbooks.get () function is intended to return a list items that you can iterate, but here we are assuming it will be find only one result
workbook = get_workbook(name='workbook_name') #gets the workbook
server.workbooks.populate_connections(workbook) #this function will load all the embedded datasources in the workbook
for datasource in workbook.connections: #iterate in datasource list
#Note: each element of this list is not an TSC.DatabaseItem, so, you will need to load a valid one using the "datasource_id" attribute from the element.
#If you try server.datasources.refresh(datasource) it will fail
ds = server.datasources.get_by_id(datasource.datasource_id) #loads a valid TSC.DatabaseItem
server.datasources.refresh(ds) #finally, you will be able to refresh it
...
The best practice is do not embeddeding datasources but publish them independently.
Update:
There is an easy way to achieve this. There are two types of extract tasks, Workbook and Data source. So, for embedded data sources, you need to perform a workbook refresh.
workbook = get_workbook(name='workbook_name')
server.workbooks.refresh(workbook.id)
You can use "tableauserverclient" Python package. You can pip install it from PyPy.
After installing it, you can consult the docs.
I will attach an example I used some time ago:
import tableauserverclient as TSC
tableau_auth = TSC.TableauAuth('user', 'pass', 'homepage')
server = TSC.Server('server')
with server.auth.sign_in(tableau_auth):
all_datasources, pagination_item = server.datasources.get()
print("\nThere are {} datasources on
site:".format(pagination_item.total_available))
print([datasource.name for datasource in all_datasources])

Google search for address using python

How to get the address of any hotel,restaurant,place,mall using python?
I have already used geopy package which works for some specific places but not for all.Is there any other way out.
Google has geocoding services but they pass data using JSON responses. You will have to parse the JSON schema.
import requests
url = 'https://maps.googleapis.com/maps/api/geocode/json'
p = {'address:' 'New York'}
r = requests.get(url, params=p).json()
results = r['results']
results will hold your location. You simply have to retrieve what's needed. Hope that helps.
If there are any problems let me know

Azure blob storage to JSON in azure function using SDK

I am trying to create a timer trigger azure function that takes data from blob, aggregates it, and puts the aggregates in a cosmosDB. I previously tried using the bindings in azure functions to use blob as input, which I was informed was incorrect (see this thread: Azure functions python no value for named parameter).
I am now using the SDK and am running into the following problem:
import sys, os.path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), 'myenv/Lib/site-packages')))
import json
import pandas as pd
from azure.storage.blob import BlockBlobService
data = BlockBlobService(account_name='accountname', account_key='accountkey')
container_name = ('container')
generator = data.list_blobs(container_name)
for blob in generator:
print("{}".format(blob.name))
json = json.loads(data.get_blob_to_text('container', open(blob.name)))
df = pd.io.json.json_normalize(json)
print(df)
This results in an error:
IOError: [Errno 2] No such file or directory: 'test.json'
I realize this might be an absolute path issue, but im not sure how that works with azure storage. Any ideas on how to circumvent this?
Made it "work" by doing the following:
for blob in generator:
loader = data.get_blob_to_text('kvaedevdystreamanablob',blob.name,if_modified_since=delta)
json = json.loads(loader.content)
This works for ONE json file, i.e I only had one in storage, but when more are added I get this error:
ValueError: Expecting object: line 1 column 21907 (char 21906)
This happens even if i add if_modified_since as to only take in one blob. Will update if I figure something out. Help always welcome.
Another update: My data is coming in through stream analytics, and then down to the blob. I have selected that the data should come in as arrays, this is why the error is occurring. When the stream is terminated, the blob doesnt immediately append ] to the EOF line in json, thus the json file isnt valid. Will try now with using line-by-line in stream analytics instead of array.
figured it out. In the end it was a quite simple fix:
I had to make sure each json entry in the blob was less than 1024 characters, or it would create a new line, thus making reading lines problematic.
The code that iterates through each blob file, reads and adds to a list is a follows:
data = BlockBlobService(account_name='accname', account_key='key')
generator = data.list_blobs('collection')
dataloaded = []
for blob in generator:
loader = data.get_blob_to_text('collection',blob.name)
trackerstatusobjects = loader.content.split('\n')
for trackerstatusobject in trackerstatusobjects:
dataloaded.append(json.loads(trackerstatusobject))
From this you can add to a dataframe and do what ever you want :)
Hope this helps if someone stumbles upon a similar problem.

How to get more insight into why documents fail to be ingested in Watson Discovery Service

I'm using the DiscoveryV1 module of the watson_developer_cloud python library to ingest 700+ documents into a WDS collection. Each time I attempt a bulk-ingestion many of the documents fail to be ingested, it is nondeterministic, usually around 100 documents fail.
Each time I call discovery.add_document(env_id, cold_id, file_info=file_info) I find that the response contains a WDS document_id. After I've made this call for all documents in my corpus I use the corresponding document_ids to call discovery.get_document(env_id, col_id, doc_id) and check the document's status. Around 100 of these calls will return the status Document failed to be ingested and indexed. There is no pattern among the files that fail, they range in size and of both msword (doc) and pdf file types.
My code to ingest a document was written based on the WDS Documentation, it looks something like this:
with open(f_path) as file_data:
if f_path.endswith('.doc') or f_path.endswith('.docx'):
re = discovery.add_document(env_id, col_id, file_info=file_data, mime_type='application/msword')
else:
re = discovery.add_document(env_id, col_id, file_info=file_data)
Because my corpus is relatively large, ~3gb in size, I recieve Service is busy processing... responses from discovery.add_document(env_id, cold_id, file_info=file_info) calls in which case I call sleep(5) and try again.
I've exhausted the WDS documentation without any luck. How can I get more insight into the reason that these files are failing to be ingested?
You should be able to use the https://watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Queries/queryNotices API to see errors/warnings that happen during ingestion along with details that might give more information on why the ingestion failed.
Unfortunately, at the time of this posting it does not look like the python SDK has a method to wrap this API yet, so you can use the Watson Discovery Tooling or use curl to query the API directly (replacing the values in {} with your collection-specific values)
curl -u "{username}:{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/notices?version=2017-01-01
The python-sdk now supports querying notices.
from watson_developer_cloud import DiscoveryV1
discovery = DiscoveryV1(
version='2017-10-16',
## url is optional, and defaults to the URL below. Use the correct URL for your region.
url='https://gateway.watsonplatform.net/discovery/api',
iam_api_key='your_api_key')
discovery.federated_query_notices('env_id', ['collection_id']])

Mapping Resolution & fixVersion received via Python JIRA REST API to human-readable values

I need to pull information on a long list of JIRA issues that live in a CSV file. I'm using the JIRA REST API in Python in a small script to see what kind of data I can expect to retrieve:
#!/usr/bin/python
import csv
import sys
from jira.client import JIRA
*...redacted*
csvfile = list(csv.reader(open(sys.argv[1])))
for row in csvfile:
r = str(row).strip("'[]'")
i = jira.issue(r)
print i.id,i.fields.summary,i.fields.fixVersions,i.fields.resolution,i.fields.resolutiondate
The ID (Key), Summary, and Resolution dates are human-readable as expected. The fixVersions and Resolution fields are resources as follows:
[<jira.resources.Version object at 0x105096b11>], <jira.resources.Resolution object at 0x105096d91>
How do I use the API to get the set of available fixVersions and Resolutions, so that I can populate this correctly in my output CSV?
I understand how JIRA stores these values, but the documentation on the jira-python code doesn't explain how to harness it to grab those base values. I'd be happy to just snag the available fixVersion and Resolution values globally, but the resource info I receive doesn't map to them in an obvious way.
You can use fixVersion.name and resolution.name to get the string versions of those values.
User mdoar answered this question in his comment:
How about using version.name and resolution.name?

Categories

Resources