Python code in AWS Lambda to load Shapefile to PostGIS(RDS) - python

I am new to GIS implementation. I am trying to develop AWS Lambda Code in Python to Load Shape File Dynamically.
I developed the code after doing some research and it is perfectly working on my local.
But the same code is troubling when I am trying to run in AWS Lambda.
I have added libraries(Lambda Layers) for 'OSGEO/GDAL' in AWS Lambda and tested it by calling import and it's working fine.
Following is the code :
import os
import subprocess
import boto3
import urllib.parse
from osgeo import gdal
from osgeo import ogr
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
s3key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
# input shapefile
input_shp = ('s3://' + bucket + '/' + s3key)
# database options
db_schema = "SCHEMA=public"
overwrite_option = "OVERWRITE=YES"
geom_type = "MULTILINESTRING"
output_format = "PostgreSQL"
# database connection string
db_connection = """PG:host=<RDS host name> port=5432 user=<RDS User Name> dbname= postgres password=<RDS Password>"""
# call ogr2ogr from python
subprocess.call(["ogr2ogr", "-lco", db_schema, "-lco", overwrite_option, "-nlt", geom_type, "-f", output_format, db_connection, input_shp])
Error Message is :
[ERROR] FileNotFoundError: [Errno 2] No such file or directory: 'ogr2ogr': 'ogr2ogr'
The same code is working fine on my local with only difference that instead of S3 I am providing hard-coded path where shape file is stored on my local machine.
Any suggestions!

Related

Azure Grab Data from Blob Storage w. Python (No downloading)

I'm trying to open a series of different cracked documents / texts that we've stored in Azure Blob storage, ideally pushing them all into a pandas db. I do not want to download them (I'm going to be opening them from a Docker Container), I just want to store the information in memory.
The file structure looks like: Azure Blob Storage -> MyContainer -> UUIDFolderNames (many) -> 1 "knowledge.json" file in each Folder.
What I've got working:
container = ContainerClient.from_connection_string( <my connection str>, <MyContainer> )
blob_list = container.list_blobs()
for blob in blob_list:
blobClient = container.get_blob_client( blob ) #Not sure this is needed
Ideally for each item in my for loop, I'd do something like opening the .json file, then adding it's text to a row in my dataframe. However, I can't actually manage to open any of the JSON files.
What I've tried:
#1
name = blob.name
json.loads( name )
#2
with open(name, 'r') as f:
data = json.load( f )
Errors:
#1 Json Decoder Error Expecting Value: line 1 column 1 (char 0)
#2: No such file or directory
I've tried other sillier things like json.loads( blob ) or json.loads('knowledge.json') (no folder name in path), but those are kinda nonsensicle things that I was just trying to see if they worked, they're not exactly reasonable.
Most methods (including on Azure's documentation) download the file first, but again, I don't want to download the file.
*Edit: I realized that its somewhat obvious why the file's cannot be found - json.load etc will look in my local directory / where I'm running the python file from, rather than the blob location. Still, not sure how to load a file w.o downloading it.
With the help of the below block you will be able to view the JSON blob:
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
Below is the full code to read JSON files from blobs, later you can add this data to your dataframes:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient,PublicAccess
import os
import logging
import sys
import azure.functions as func
from azure.storage import blob
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
def UploadFiles():
CONNECTION_STRING="ENTER_CONNECTION_STR"
Container_name="gatherblobs"
service_client=BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = service_client.get_container_client(Container_name)
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
if __name__ == '__main__':
UploadFiles()

Read an object from Alibaba OSS and modify it using pandas python

So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud.
I am currently executing a Python script, wherein:
I download the file into my local machine.
Do the changes using Python script in my local machine.
Store it in AWS Cloud.
I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script.
The Python script will be uploaded into Task Management of Alibaba Cloud.
So the new steps will be:
Read a file from the OSS bucket into Pandas.
Modify it - Merging it with other data, some column changes. - Will be done in pandas.
Store the modified file into AWS RDS.
I am stuck at the first step itself.
Error Log:
"No module found" for OSS2 & pandas.
What is the correct way of doing it?
This is a rough draft of my script (on how was able to execute script in my local machine):
import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
import os,re
import oss2
import datetime as dt
import pandas as pd
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
import io ## include this new library
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
bucket_object = bucket.get_object(order_file).read() ## read the file from OSS
img_buf = io.BytesIO(bucket_object))
df = pd.read_csv(img_buf) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName

Read .odt and .doc File from url in python

How can i extract text from '.odt' and '.doc' format file from url using python ? I tried searching for it but couldn't find anything.
Any lead will be helpful.
from odf import text, teletype
from odf.opendocument import load
textdoc = load(r"C:\Users\OMS\Downloads\sample1.odt")
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
a=teletype.extractText(allparas[i])
print(a)
this works for local .odt file but now i need to extract from an
"https://abc.s3.ap-south-1.amazonaws.com/sample1.odt"
Assume connection to aws s3 has been done using boto3 .
Following is tested with Python3.6 and with this test odt file;
import boto3
import io
from odf import text, teletype
from odf.opendocument import load
s3_client = boto3.resource('s3') #TODO: change aws connection logic as per your setup
# TODO: refactor name, readability
def get_contents(file_name):
obj = s3_client.Object('s3_bucket_name', file_name) # TODO: change aws s3 bucket name as per your setup
body = obj.get()['Body'].read()
return load(io.BytesIO(body))
textdoc = get_contents("test.odt") # TODO: change odt file name as per your setup
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
a = teletype.extractText(allparas[i])
print(a)

Django not recognizing or seeing JSON file

I've been working on trying to integrate google sheets with django, i'm trying to use gspread. I can see the data using python filename.py, but when I run python manage.py runserver, I keep getting this error:
IOError: [Errno 2] No such file or directory: 'key.json'
It's not recognizing for seeing my json file for some reason, i've also tried using 'key' without the .json, no luck. I've been googling here, any ideas here? Here's my code below
*************************** code below *******************************
import gspread
import json
from oauth2client.service_account import ServiceAccountCredentials
import os
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('key.json', scope)
gc = gspread.authorize(credentials)
wks = gc.open("RAMP - Master").sheet1
print wks
cell_list = wks.range('A1:B7')
print cell_list
If key.json is in the same directory as the file you're running, then the correct syntax is:
import os
DIRNAME = os.path.dirname(__file__)
credentials = ServiceAccountCredentials.from_json_keyfile_name(
os.path.join(DIRNAME, 'key.json'),
scope
)

How to open a remote file with GDAL in Python through a Flask application

So, I'm developing a Flask application which uses the GDAL library, where I want to stream a .tif file through an url.
Right now I have method that reads a .tif file using gdal.Open(filepath). When run outside of the Flask environment (like in a Python console), it works fine by both specifying the filepath to a local file and a url.
from gdalconst import GA_ReadOnly
import gdal
filename = 'http://xxxxxxx.blob.core.windows.net/dsm/DSM_1km_6349_614.tif'
dataset = gdal.Open(filename, GA_ReadOnly )
if dataset is not None:
print 'Driver: ', dataset.GetDriver().ShortName,'/', \
dataset.GetDriver().LongName
However, when the following code is executed inside the Flask environement, I get the following message:
ERROR 4: `http://xxxxxxx.blob.core.windows.net/dsm/DSM_1km_6349_614.tif' does
not exist in the file system,
and is not recognised as a supported dataset name.
If I instead download the file to the local filesystem of the Flask app, and insert the path to the file, like this:
block_blob_service = get_blobservice() #Initialize block service
block_blob_service.get_blob_to_path('dsm', blobname, filename) # Get blob to local filesystem, path to file saved in filename
dataset = gdal.Open(filename, GA_ReadOnly)
That works just fine...
The thing is, since I'm requesting some big files (200 mb), I want to stream the files using the url instead of the local file reference.
Does anyone have an idea of what could be causing this? I also tried putting "/vsicurl_streaming/" in front of the url as suggested elsewhere.
I'm using Python 2.7, 32-bit with GDAL 2.0.2
Please try the follow code snippet:
from gzip import GzipFile
from io import BytesIO
import urllib2
from uuid import uuid4
from gdalconst import GA_ReadOnly
import gdal
def open_http_query(url):
try:
request = urllib2.Request(url,
headers={"Accept-Encoding": "gzip"})
response = urllib2.urlopen(request, timeout=30)
if response.info().get('Content-Encoding') == 'gzip':
return GzipFile(fileobj=BytesIO(response.read()))
else:
return response
except urllib2.URLError:
return None
url = 'http://xxx.blob.core.windows.net/container/example.tif'
image_data = open_http_query(url)
mmap_name = "/vsimem/"+uuid4().get_hex()
gdal.FileFromMemBuffer(mmap_name, image_data.read())
dataset = gdal.Open(mmap_name)
if dataset is not None:
print 'Driver: ', dataset.GetDriver().ShortName,'/', \
dataset.GetDriver().LongName
Which use a GDAL memory-mapped file to open an image retrieved via HTTP directly as a NumPy array without saving to a temporary file.
Refer to https://gist.github.com/jleinonen/5781308 for more info.

Categories

Resources