I currently have a .csv file in an S3 bucket that I'd like to append to a table in a Redshift database using a Python script. I have a separate file parser and upload to S3 that work just fine.
The code I have for connecting to/copying into the table is below here. I get the following error message:
OperationalError: (psycopg2.OperationalError) could not connect to server: Connection timed out (0x0000274C/10060)
Is the server running on host "redshift_cluster_name.unique_here.region.redshift.amazonaws.com" (18.221.51.45) and accepting
TCP/IP connections on port 5439?
I can confirm the following:
Port is 5439
Not encrypted
Cluster name/DB name/username/password are all correct
Publicly accessible set to "Yes"
What should I be fixing to make sure I can connect my file in S3 to Redshift? Thank you all for any help you can provide.
Also I have looked around on Stack Overflow and ServerFault but these seem to either be for MySQL to Redshift or the solutions (like the linked ServerFault CIDR solution) did not work.
Thank you for any help!
DATABASE = "db"
USER = "user"
PASSWORD = "password"
HOST = "redshift_cluster_name.unique_here.region.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "public"
S3_FULL_PATH = 's3://bucket/file.csv'
#ARN_CREDENTIALS = 'arn:aws:iam::aws_id:role/myRedshiftRole'
REGION = 'region'
############ CONNECTING AND CREATING SESSIONS ############
connection_string = f"redshift+psycopg2://{USER}:{PASSWORD}#{HOST}:{PORT}/{DATABASE}"
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = f"SET search_path TO {SCHEMA}"
s.execute(SetPath)
###########################################################
############ RUNNING COPY ############
copy_command = f
'''
copy category from '{S3_FULL_PATH}'
credentials 'aws_iam_role={ARN_CREDENTIALS}'
delimiter ',' region '{REGION}';
'''
s.execute(copy_command)
s.commit()
######################################
#################CLOSE SESSION################
s.close()
##############################################
Connecting via a Python program would require the same connectivity as connecting from an SQL Client.
I created a new cluster so I could document the process for you.
Here's the steps I took:
Created a VPC with CIDR of 10.0.0.0/16. I don't really need to create another VPC, but I want to avoid any problems with prior configurations.
Created a Subnet in the VPC with CIDR of 10.0.0.0/24.
Created an Internet Gateway and attached it to the VPC.
Edited the default Route Table to send 0.0.0.0/0 traffic to the Internet Gateway. (I'm only creating a public subnet, so don't need a route table for private subnet.)
Created a Redshift Cluster Subnet Group with the single subnet I created.
Launch a 1-node Redshift cluster into the Cluster Subnet Group. Publicly accessible = Yes, default Security Group.
Went back to the VPC console to edit the Default Security Group. Added an Inbound rule for Redshift from Anywhere.
Waited for the Cluster to become ready.
I then used DbVisualizer to login to the database. Success!
The above steps made a publicly-available Redshift cluster and I connected to it from my computer on the Internet.
Related
I've created an ec2 instance with AWS CDK in python. I've added a security group and allowed ingress rules for ipv4 and ipv6 on port 22. The keypair that I specified, with the help of this stack question has been used in other EC2 instances set up with the console with no issue.
Everything appears to be running, but my connection keeps timing out. I went through the checklist of what usually causes this provided by amazon, but none of those common things seems to be the problem (at least to me).
Why can't I connect with my ssh keypair from the instance I made with AWS CDK? I'm suspecting the KeyName I am overriding is not the correct name in Python, but I can't find it in the cdk docs.
Code included below.
vpc = ec2.Vpc.from_lookup(self, "VPC", vpc_name=os.getenv("VPC_NAME"))
sec_group = ec2.SecurityGroup(self, "SG", vpc=vpc, allow_all_outbound=True)
sec_group.add_ingress_rule(ec2.Peer.any_ipv4(), connection=ec2.Port.tcp(22))
sec_group.add_ingress_rule(ec2.Peer.any_ipv6(), connection=ec2.Port.tcp(22))
instance = ec2.Instance(
self,
"name",
vpc=vpc,
instance_type=ec2.InstanceType.of(ec2.InstanceClass.T2, ec2.InstanceSize.MICRO),
machine_image=ec2.AmazonLinuxImage(
generation=ec2.AmazonLinuxGeneration.AMAZON_LINUX_2
),
security_group=sec_group,
)
instance.instance.add_property_override("KeyName", os.getenv("KEYPAIR_NAME"))
elastic_ip = ec2.CfnEIP(self, "EIP", domain="vpc", instance_id=instance.instance_id)
This is an issue with internet reachability, not your SSH key.
By default, your instance is placed into a private subnet (docs), so it will not have inbound connectivity from the internet.
Place it into a public subnet and it should work.
Also, you don't have to use any overrides to set the key - use the built-in key_name argument. And you don't have to create the security group - use the connections abstraction. Here's the complete code:
vpc = ec2.Vpc.from_lookup(self, "VPC", vpc_name=os.getenv("VPC_NAME"))
instance = ec2.Instance(
self,
"name",
vpc=vpc,
instance_type=ec2.InstanceType.of(ec2.InstanceClass.T2, ec2.InstanceSize.MICRO),
machine_image=ec2.AmazonLinuxImage(
generation=ec2.AmazonLinuxGeneration.AMAZON_LINUX_2
),
key_name=os.getenv("KEYPAIR_NAME"),
vpc_subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PUBLIC),
)
instance.connections.allow_from_any_ipv4(ec2.Port.tcp(22))
elastic_ip = ec2.CfnEIP(self, "EIP", domain="vpc", instance_id=instance.instance_id)
I developed a simple python Azure function app using pyodbc to select a few rows from a public IP MS SQL server. My function app runs fine on my laptop, but it doesn't work when I publish it on Azure cloud (I used Consumption - serverless plan, linux environment). Thru the logging, I knows that it always gets stuck at the pyodbc.connect(...) command and time-out.
#...
conn_str = f'Driver={driver};Server={server},{port};Database={database};Uid={user};Pwd={password};Encrypted=yes;TrustServerCertificate=no;Connection Timeout=30'
sql_query = f'SELECT * FROM {table_name}'
try:
conn = pyodbc.connect(conn_str) # always time-out here if running on Azure cloud!!!
logging.info(f'Inventory API - connected to {server}, {port}, {user}.')
except Exception as error:
logging.info(f'Inventory API - connection error: {repr(error)}.')
else:
with conn.cursor() as cursor:
cursor.execute(sql_query)
logging.info(f'Inventory API - executed query: {sql_query}.')
data = []
for row in cursor:
data.append({'Sku' : row.Sku, 'InventoryId' : row.InventoryId, 'LocationId' : row.LocationId, 'AvailableQuantity' : row.AvailableQuantity})
#...
The logging captured:
Inventory API - connection error: OperationalError('HYT00', '[HYT00] [Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired (0) (SQLDriverConnect)').
I already include the pyodbc in the requirements.txt file. I also allows all outboundIpAddresses and possibleOutboundIpAddresses of my function app on my SQL server firewall. My function app does not have any network restriction on Azure cloud (or at least it said so on the network settings).
my config file:
driver={ODBC Driver 17 for SQL Server}
server=I tried both IP and full internet host name, both didn't work
Could someone give me a hint? Thanks.
Workarounds to fix the PYODBC connection Error
Try to use a below format
import pyodbc
server = 'tcp:myserver.database.windows.net'
database = 'mydb'
username = 'myusername'
password = 'mypassword'
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
If you are using Domain Name add TCP with domain name server = 'tcp:myserver.database.windows.net' otherwise use the IP server = '129.0.0.1’
If you are using port use like this 'tcp:myserver.database.windows.net,1233’ or '129.0.0.1,1233'
Try to remove additional properties like Connection_Timeout, Trusted_certificate and all and check now.
Refer here
I put the following snippet into my function to check the outbound IP, and found out that Azure use a few outbound IPs that are not listed in the [outboundIpAddresses] and [possibleOutboundIpAddresses] (documented in this MS link)
import requests
#...
outbound_ip_response = requests.request('GET', 'https://checkip.amazonaws.com')
logging.info(f'Inventory API - main()- outbound ip = {outbound_ip_response.text}')
So, I followed the instructions in this link to setup a static outbound IP for my function app and allowed this IP to access my SQL server. It worked.
I have a Django app below that uses a proxy to connect to an external Postgres database. I had to replace another package with psycopg2 and it works fine locally, but doesn't work when I move onto our production server which is a Heroku app using QuotaguardStatic for proxy purposes. I'm not sure what's wrong here
For some reason, the psycopg2.connect part returns an error with a different IP address. Is it not inheriting the proxy set in the context manager? What would be
from apps.proxy.socks import Socks5Proxy
import requests
PROXY_URL = os.environ['QUOTAGUARDSTATIC_URL']
with Socks5Proxy(url=PROXY_URL) as p:
public_ip = requests.get("http://wtfismyip.com/text").text
print(public_ip) # prints the expected IP address
print('end')
try:
connection = psycopg2.connect(user=EXTERNAL_DB_USERNAME,
password=EXTERNAL_DB_PASSWORD,
host=EXTERNAL_DB_HOSTNAME,
port=EXTERNAL_DB_PORT,
database=EXTERNAL_DB_DATABASE,
cursor_factory=RealDictCursor # To access query results like a dictionary
) # , ssl_context=True
except psycopg2.DatabaseError as e:
logger.error('Unable to connect to Illuminate database')
raise e
Error is:
psycopg2.OperationalError: FATAL: no pg_hba.conf entry for host "12.345.678.910", user "username", database "databasename", SSL on
Basically, the IP address 12.345.678.910 does not match what was printed at the beginning of the context manager where the proxy is set. Do I need to set a proxy another method so that the psycopg2 connection uses it?
I am facing a problem to make my Apache Beam pipeline work on Cloud Dataflow, with DataflowRunner.
The first step of the pipeline is to connect to an external Postgresql server hosted on a VM which is only externally accessible through SSH, port 22, and extract some data. I can't change these firewalling rules, so I can only connect to the DB server via SSH tunneling, aka port-forwarding.
In my code I make use of the python library sshtunnel. It works perfectly when the pipeline is launched from my development computer with DirectRunner:
from sshtunnel import open_tunnel
with open_tunnel(
(user_options.ssh_tunnel_host, user_options.ssh_tunnel_port),
ssh_username=user_options.ssh_tunnel_user,
ssh_password=user_options.ssh_tunnel_password,
remote_bind_address=(user_options.dbhost, user_options.dbport)
) as tunnel:
with beam.Pipeline(options=pipeline_options) as p:
(p | "Read data" >> ReadFromSQL(
host=tunnel.local_bind_host,
port=tunnel.local_bind_port,
username=user_options.dbusername,
password=user_options.dbpassword,
database=user_options.dbname,
wrapper=PostgresWrapper,
query=select_query
)
| "Format CSV" >> DictToCSV(headers)
| "Write CSV" >> WriteToText(user_options.export_location)
)
The same code, launched with DataflowRunner inside a non-default VPC where all ingress are deny but no egress restriction, and CloudNAT configured, fails with this message:
psycopg2.OperationalError: could not connect to server: Connection refused Is the server running on host "0.0.0.0" and accepting TCP/IP connections on port 41697? [while running 'Read data/Read']
So, obviously something is wrong with my tunnel but I cannot spot what exactly. I was beginning to wonder whether a direct SSH tunnel setup was even possible through CloudNAT, until I found this blog post: https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1 stating:
A core strength of Cloud Dataflow is that you can call external services for data enrichment. For example, you can call a micro service to get additional data for an element.
Within a DoFn, call-out to the service (usually done via HTTP). You have full control to make any type of connection that you choose, so long as the firewall rules you set up within your project/network allow it.
So it should be possible to set up this tunnel ! I don't want to give up but I don't know what to try next. Any idea ?
Thanks for reading
Problem solved ! I can't believe I've spent two full days on this... I was looking completely in the wrong direction.
The issue was not with some Dataflow or GCP networking configuration, and as far as I can tell...
You have full control to make any type of connection that you choose, so long as the firewall rules you set up within your project/network allow it
is true.
The problem was of course in my code : only the problem was revealed only in a distributed environment. I had make the mistake of opening the tunnel from the main pipeline processor, instead of the workers. So the SSH tunnel was up but not between the workers and the target server, only between the main pipeline and the target!
To fix this, I had to change my requesting DoFn to wrap the query execution with the tunnel :
class TunnelledSQLSourceDoFn(sql.SQLSourceDoFn):
"""Wraps SQLSourceDoFn in a ssh tunnel"""
def __init__(self, *args, **kwargs):
self.dbport = kwargs["port"]
self.dbhost = kwargs["host"]
self.args = args
self.kwargs = kwargs
super().__init__(*args, **kwargs)
def process(self, query, *args, **kwargs):
# Remote side of the SSH Tunnel
remote_address = (self.dbhost, self.dbport)
ssh_tunnel = (self.kwargs['ssh_host'], self.kwargs['ssh_port'])
with open_tunnel(
ssh_tunnel,
ssh_username=self.kwargs["ssh_user"],
ssh_password=self.kwargs["ssh_password"],
remote_bind_address=remote_address,
set_keepalive=10.0
) as tunnel:
forwarded_port = tunnel.local_bind_port
self.kwargs["port"] = forwarded_port
source = sql.SQLSource(*self.args, **self.kwargs)
sql.SQLSouceInput._build_value(source, source.runtime_params)
logging.info("Processing - {}".format(query))
for records, schema in source.client.read(query):
for row in records:
yield source.client.row_as_dict(row, schema)
as you can see, I had to override some bits of pysql_beam library.
Finally, each worker open its own tunnel for each request. It's probably possible to optimize this behavior but it's enough for my needs.
I need to connect the Python script with google sql.
I have created the database in google sql and now want to connect it with python script.
I am trying to use this so far:
conn = pymysql.connect(host='ip.ip.ipi.ip',user='user', passwd = "passwd", db = 'mysql',
charset = 'utf8')
cur = conn.cursor()
cur.execute("USE Example_Database")
but getting this error:
OperationalError: (2003, "Can't connect to MySQL server on '35.246.235.61' (timed out)")
Also tried to use this but not working:
from google.cloud.sql.connector import connector
connector.connect(
"instance:instancename",
"mysql-connector",
host='ip.ip.ipi.ip',
user="user",
password="passwd",
db="Database"
)
It seems there is connectivity issue to Google SQL from where you run the code.
Where are you running your python code?
Can you check if it can connec to you Google SQL endpoint on port 3306 (or custom port you are using)?
telnet <Google_SQL_IP> 3306
Depending on where your python code is hosted you need to allow necessary access for your app server to cloud sql.
You can connect to a Cloud SQL instance using the following methods:
By using the proxy (Second Generation instances only)
By configuring access for one or more public IP addresses
By using the JDBC Socket Factory (for the Java programming language, Second Generation instances only)
By using the Cloud SQL Proxy library (for the Go programming language, Second Generation instances only)
Please review the following link which describes options for you to connect:
https://cloud.google.com/sql/docs/mysql/connect-external-app