Internet access within AWS Glue job - python

Do Glue jobs have internet access?
Using this test job:
def have_internet():
conn = httplib.HTTPConnection("www.google.com", timeout=5)
try:
conn.request("HEAD", "/")
conn.close()
logger.warn('ok')
except:
conn.close()
logger.warn('no ok')
have_internet()
It appears they do not...
Also, within a properly configured Glue dev endpoint, there is no internet access.
By properly configured, I mean within a public subnet (internet gateway), with S3 endpoint and internet gateway, and a working 'connection', and security groups.
But still no internet access...
I want internet access to be able to interrogate an on prem database, save to S3, and run another job to transform, and load to rds...
Can I use glue for the extract?

This issue has resolved itself now, I suspect due to an update in Glue, or the associated infrastructure.
The connectivity issue was occuring from within the PySpark REPL, and not on the actual Dev Endpoint instance itself...
Anyway, for anyone else troubleshooting similar network connectivity issues with Glue, here is a list of possible causes:
Dev Endpoint needs to be in a 'public' subnet*
DHCP options need to have the default setting
Security groups, security groups, security groups
Subnet should be associated with an S3 Endpoint
...

Related

AWS Lambda to RDS PostgreSQL

Hello fellow AWS contributors, I’m currently working on a project to set up an example of connecting a Lambda function to our PostgreSQL database hosted on RDS. I tested my Python + SQL code locally (in VS code and DBeaver) and it works perfectly fine with including only basic credentials(host, dbname, username password). However, when I paste the code in Lambda function, it gave me all sorts of errors. I followed this template and modified my code to retrieve the credentials from secret manager instead.
I’m currently using boto3, psycopg2, and secret manager to get credentials and connect to the database.
List of errors I’m getting-
server closed the connection unexpectedly. This probably means the server terminated abnormally before or while processing the request
could not connect to server: Connection timed out. Is the server running on host “db endpoint” and accepting TCP/IP connections on port 5432?
FATAL: no pg_hba.conf entry for host “ip:xxx”, user "userXXX", database "dbXXX", SSL off
Things I tried -
RDS and Lambda are in the same VPC, same subnet, same security group.
IP address is included in the inbound rule
Lambda function is set to run up to 15 min, and it always stops before it even hits 15 min
I tried both database endpoint and database proxy endpoint, none of it works.
It doesn’t really make sense to me that when I run the code locally, I only need to provide the host, dbname, username, and password, that’s it, and I’m able to write all the queries and function I want. But when I throw the code in lambda function, it’s requiring all these secret manager, VPC security group, SSL, proxy, TCP/IP rules etc. Can someone explain why there is a requirement difference between running it locally and on lambda?
Finally, does anyone know what could be wrong in my setup? I'm happy to provide any information in related to this, any general direction to look into would be really helpful. Thanks!
Following the directions at the link below to build a specific psycopg2 package and also verifying the VPC subnets and security groups were configured correctly solved this issue for me.
I built a package for PostgreSQL 10.20 using psycopg2 v2.9.3 for Python 3.7.10 running on an Amazon Linux 2 AMI instance. The only change to the directions I had to make was to put the psycopg2 directory inside a python directory (i.e. "python/psycopg2/") before zipping it -- the import psycopg2 statement in the Lambda function failed until I did that.
https://kalyanv.com/2019/06/10/using-postgresql-with-python-on-aws-lambda.html
This the VPC scenario I'm using. The Lambda function is executing inside the Public Subnet and associated Security Group. Inbound rules for the Private Subnet Security Group only allow TCP connections to 5432 for the Public Subnet Security Group.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html#USER_VPC.Scenario1

Unable to connect to aws redshift from python within lambda

I am trying to connect to redshift with python through lambda. The purpose is to perform queries on the redshift database.
I've tried this by getting the temp aws credentials and connecting with psycopg2, but it isn't successful without any error messages. (IE: the lambda just time out)
rs_host = "mytest-cluster.fooooooobaaarrrr.region111111.redshift.amazonaws.com"
rs_port = 5439
rs_dbname = "dev"
db_user = "barrr_user"
def lambda_handler(events, contx):
# The cluster_creds is able to be obtained successfully. No issses here
cluster_creds = client.get_cluster_credentials(DbUser=db_user,
DbName=rs_dbname,
ClusterIdentifier="mytest-cluster",
AutoCreate=False)
try:
# It is this psycopg2 connection that cant work...
conn = psycopg2.connect(host=rs_host,
port=rs_port,
user=cluster_creds['DbUser'],
password=cluster_creds['DbPassword'],
database=rs_dbname
)
return conn
except Exception as e:
print(e)
Also, the lambda execution role itself has these policies:
I am not sure why am I still not able to connect to redshift via python to perform queries.
I have also tried with the sqlalchemy libary but no luck there.
As what Johnathan Jacobson mentioned above. It was the security groups and network permissions that caused my problem.
You can maybe review the documentation at Create AWS Lambda Function to Connect Amazon Redshift with C-Sharp in Visual Studio
Since you have already your code in Python, you can concentrate on the networking part of the tutorial
While launching AWS Lambda functions, it is possible to select a VPC and subnet where the serverless lambda function servers will spinup
You can choose exactly the same VPC and the subnet(s) where you have created your Amazon Redshift cluster
Also, revise the IAM role you have attached to the AWS Lambda function. It requires additionally the AWSLambdaVPCAccessExecutionRole policy
This will be solving issues between connections from different VPCs
Again, even you have launched the lambda function in the same VPC and subnet with Redshift cluster, it is better to check the security group of the cluster so that it accepts connections
Hope it works,

Not able to connect with mongo from AWS lambda functionin python

I have aws lambda which is not able to connect with mongo through VPC.
import pymongo
def handler(event, context):
try:
client = pymongo.MongoClient(host="xxxxxxx", port=27017, username=x1, password=x2, authsource="x3, authMechanism='SCRAM-SHA-1')
except pymongo.errors.ServerSelectionTimeoutError as err:
print(err)
Not able to found the server.
I have created a security group and new roles have given VPC and lambda full access too but not able to connect.
Taken help from https://blog.shikisoft.com/access-mongodb-instance-from-aws-lambda-python/ as well as https://blog.shikisoft.com/running-aws-lambda-in-vpc-accessing-rds/ links.
Please be helpful.
Trying since yesterday but no luck.
Let me try to help you figured out where the problem is.
1. Are your MongoDB EC2 Instance and your Lambda hosted on the same VPC?
If this the cause of your problem, you should move your services into the same VPC.
2. Is your Security Group that attached to your MongoDB EC2 Instance
and your Lambda has whitelisted/include the default sg?
You have to include the default sg into your Security Group so, services/instances within that VPC can communicate.
3. Is your hostname publicly or privately accessed ?
If Lambda needs to connect over Internet to access your MongoDB instance, you don't need to attach your Lambda into a VPC.
Inside a VPC, Lambda requires a NAT Gateway to communicate to open
world. Try to communicate privately if your MongoDB instance and
Lambda are in the same VPC.
https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/
Hope these answers are helpful to you.

AWS Lambda Function cannot access other services

I have a problem with an AWS Lambda Function which depends upon DynamoDB and SQS to function properly. When I try to run the lambda stack, they time out when trying to connect to the SQS service. The AWS Lambda Function lies inside a VPC with the following setup:
A VPC with four subnets
Two subsets are public, routing their 0.0.0.0/16 traffic to an internet gateway
A MySQL server sits in a public subnet
The other two contain the lambdas and route their 0.0.0.0/16 traffic to a NAT which lives in one of the public subnets.
All route tables have a 10.0.0.0/16 to local rule (is this the problem because Lambdas use private Ip's inside a VPC?)
The main rout table is the one with the NAT, but I explicitly associated the public nets with the internet gateway routing table
The lambdas and the mysql server share a security group which allows for inbound internal access (10.x/16) as well as unrestricted outbound traffic (0.0.0.0/16).
Traffic between lambdas and the mysql instance is no problem (except if I put the lambdas outside the VPC, then they can't access the server even if I open up all ports). Assume the code for the lambdas is also correct, as it worked before I tried to mask it in a private net. Also the lambda execution roles have been set accordingly (or do they need adjustments after moving them to a private net?).
Adding a dynamodb endpoint solved the problems with the database, but there are no VPC endpoints available for some of the other services. Following some answers I found here, here, here and in the announcements / tutorials here and here, I am pretty sure I followed all the recommended steps.
I would be very thankful and glad for any hints where to check next, as I have currently no idea what could be the problem here.
EDIT: The function don't seem to have any internet access at all, since a toy example I checked also timed out:
import urllib.request
def lambda_handler(event, context):
test = urllib.request.urlopen(url="http://www.google.de")
return test.status
Of course the problem was sitting in front of the monitor again. Instead of routing 0.0.0.0/0 (any traffic) to the internet gateway, I had just specified 0.0.0.0/16 (traffic from machines with an 0.0.x.x ip) to the gate. Since no machines with such ip exists any traffic was blocked from entering leaving the VPC.
#John Rotenstein: Thx, though for the hint about lambdash. It seems like a very helpful tool.
Your configuration sounds correct.
You should test the configuration to see whether you can access any public Internet sites, then test connecting to AWS.
You could either write a Lambda function that attempts such connections or you could use lambdash that effectively gives you a remote shell running on Lambda. This way, you can easily test connectivity from the command line, such as curl.

AWS Lambda sending HTTP request

This is likely a question with an easy answer, but i can't seem to figure it out.
Background: I have a python Lambda function to pick up changes in a DB, then using HTTP post the changes in json to a URL. I'm using urllib2 sort of like this:
# this runs inside a loop, in reality my error handling is much better
request = urllib2.Request(url)
request.add_header('Content-type', 'application/json')
try:
response = urllib2.urlopen(request, json_message)
except:
response = "Failed!"
It seems from the logs either the call to send the messages is skipped entirely, or times-out while waiting for a response.
Is there a permission setting I'm missing, the outbound rules in AWS appear to be right. [Edit] - The VPC applied to this lambda does have internet access, and the security groups applied appear to allow internet access. [/Edit]
I've tested the code locally (connected to the same data source) and it works flawlessly.
It appears the other questions related to posting from a lambda is related to node.js, and usually because the url is wrong. In this case, I'm using a requestb.in url, that i know is working as it works when running locally.
Edit:
I've setup my NAT gateway, and it should work, I've even gone as far as going to a different AWS account, re-creating the conditions, and it works fine. I can't see any Security Groups that would be blocking access anywhere. It's continuing to time-out.
Edit:
Turns out i was just an idiot when i setup my default route to the NAT Gateway, out of habit i wrote 0.0.0.0/24 instead of 0.0.0.0/0
If you've deployed your Lambda function inside your VPC, it does not obtain a public IP address, even if it's deployed into a subnet with a route to an Internet Gateway. It only obtains a private IP address, and thus can not communicate to the public Internet by itself.
To communicate to the public Internet, Lambda functions deployed inside your VPC need to be done so in a private subnet which has a route to either a NAT Gateway or a self-managed NAT instance.
I have also faced the same issue. I overcame it by using boto3 to invoke a lambda from another lambda.
import boto3
client = boto3.client('lambda')
response = client.invoke(
FunctionName='string',
InvocationType='Event'|'RequestResponse'|'DryRun',
LogType='None'|'Tail',
ClientContext='string',
Payload=b'bytes'|file,
Qualifier='string'
)
But make sure that you set the IAM policy for lambda role (in the Source AWS account) to invoke that another lambda.
Adding to the above, boto3 uses HTTP at the backend.

Categories

Resources