URLFetch behind a Proxy Server on App Engine Production

URLFetch behind a Proxy Server on App Engine Production - python

Is there a way to specify a proxy server when using urlfetch on Google App Engine?
Specifically, every time I make a call using urlfetch, I want GAE to go through a proxy server. I want to do this on production, not just dev.
I want to use a proxy because there are problems with using google's outbound IP addresses (rate limiting, no static outbound IP, sometimes blacklisted, etc.). Setting a proxy is normally easy if you can edit the http message itself, but GAE's API does not appear to let you do this.

You can always roll your own:
In case of fixed destination: just setup a fixed port forwarding on a proxy server. Then send requests from GAE to proxy. If you have multiple destinations, then set forwarding on separate ports, one for each destination.
In case of dynamic destination (too much to handle via fixed port forwarding), your GAE app adds a custom http header (X-Something) containing final destination and then connects to custom proxy. Custom proxy inspects this field and forwards the request to the destination.

We ran into this issue and reached out to Google Cloud support. They suggested we use Google App Engine flexible with some app.yaml settings, custom network, and an ip-forwarding NAT gateway instance.
This didn't work for us because many core features from App Engine Standard are not implemented in App Engine Flexible. In essence we would need to rewrite our product.
So, to make applicable URL fetch requests appear to have a static IP we made a custom proxy: https://github.com/csgactuarial/app-engine-proxy
For redundancy reasons, I suggest implementing this as a multi region, multi zone, load balanced system.

Related

Eliminating nuisance Instance starts

My GCP app has been abused by some users. To stop their usage I have attempted to eliminate features that can be abused, and have employed firewall rules to block certain users. But bad users continue to try to access my app via certain legacy URLs such as myapp.appspot.com/badroute. Of course, I still want users to use the default URL myapp.appspot.com .
I have altered my code in the following manner, but I am still getting Instances to start from them, and I do not want Instances in such cases. What can I do differently to avoid the bad Instances starting OR is there anything I can do to force such Instances to stop quickly instead of after about 15 minutes?
class Dummy(webapp2.RequestHandler):
def get(self):
logging.info("Dummy: " )
self.redirect("/")
app = webapp2.WSGIApplication(
[('/', MainPage),
('/badroute', Dummy)],debug=True)
(I may be referring to Instances when I should be referring to Requests.)

So whats the objective? you want users that visit /badroute to be redirected to some /goodroute ? or you want /badroute to not hit GAE and incur cost?
Putting a google cloud load balancer in front could help.
For the first case you could setup a redirect rule (although you can do this directly within App Engine too, like you did in your code example).
If you just want it to not hit app engine you could setup google cloud load balancer to have the /badroute route to some file in a GCS bucket instead of your GAE service
https://cloud.google.com/load-balancing/docs/https/ext-load-balancer-backend-buckets
However you wouldnt be able to use your *.appsot.com base url. You'd get a static IP which you should then map a custom domain to it

DISCLAIMER: I'm not 100% sure if this would work.
Create a new service dummy.
Create and deploy a dispatch.yaml (GAE Standard // GAE Flex)
Add the links you want to block to the dispatch.yaml and point them to the dummy service.
Set up the Identity Aware Proxy (IAP) and enable it for the dummy service.
???
Profit
The idea is that the IAP will block the requests before they hit the dummy service. Since the requests never actually get forwarded to the service dummy you will not have an instance start. The bots will get a nice 403 page from Google's own infrastructure instead.
EDIT: Be sure to create the dummy service with 0 instances as the idea is to not have it cost money.
EDIT2:
So let me expand a bit on this answer.
You can have multiple GAE services running within one GCP project. Each service is it's own app. You can have one service running a python Flask app and another running a Java Springboot app. You can have each be either GAE Standard or GAE Flex. See this doc.
Normally all traffic gets routed to the default service. Using dispatch.yaml you can make request to certain endpoints go to a specific service.
If you create the dummy service as a GAE Standard app, and you don't actually need it to do anything, you can then route all the endpoints that get abused to this dummy service using the dispatch.yaml. Using GAE Standard you can have the service use 0 instances (and 0 costs).
Using the IAP you can then make sure only your own Google account can access this app (which you won't do). In effect this means that the abusers cannot really access the service, as the IAP blocks it before hitting the service, as you've set it up so only your Google account can access it.
Note, the dispatch.yaml is separate from any services, it's one of the per-project configuration files for GAE. It's not tied to a specific service.
As stated, the dummy app doesn't actually need to do anything, but you need to deploy it once though, as this basically creates the service.

Consider using cloudflare to mitigate bot abuse, customize firewall rules regarding route access, rate limit ips, etc. This can be combined with Google cloud load balancer, if you’d like—as mentioned in https://stackoverflow.com/a/69165767/806876.
References
Cloudflare GCP integration: https://www.cloudflare.com/integrations/google-cloud/

There is a little information I did not provide in my question about my app.yaml:
handlers:
- url: /.*
script: mainapp.app
By simply removing .* from the url specification, no Instance start is created. The user gets Error: Not Found, though. So that satisfies my needs.
Edo Akse's Answer pushed me to this answer by reading here, so I am accepting his answer. I am still not clear how to implement Edo's Answer, though.

Connect Cloud Function and App Engine internally

I have an arquitecture where a Cloud Function gets trigered whenever a file gets uploaded to a bucket and send the task to an API built on Flask and deployed on App Engine.
I want to make this process internal so that only the Cloud Function can access the App Engine endpoints, but I am struggling with the process.
As these two services are serverless, I can't just filter the traffic in the App Engine firewall since the Cloud Function will have a different IP each time a new instance is created.
I have tried to follow this guide, in which they recommend to associate all the function egress traffic to a Serverless VPC Connector asigned to a subnet, and then control all the traffic of that subnet with a NAT, assigning it a static IP address. This way I could filter on my App Engine firewall by the NAT IP, which will always be the same.
After following all the steps, I am still not able to filter the traffic. With this configuration done, if I open the traffic to everyone and print the IP routes given by the App Engine header X-Forwarded-For when I send a simple GET request from the Cloud Function, it returns the following 0.0.0.0, 169.254.1.1 (it is a list since this header records the clint IP and the proxies involved in the route). The static IP address assigned to my NAT is 34.78.XX.XX, so it seems that the function is not using the NAT to redirect the traffic.
I have read somewhere that when the destiny IP is hosted on Google Cloud, the traffic will not go through the NAT gateway, so maybe this solution won't work on my usecase.
Any idea what am I doing wrong, or if there exist any alternatives for making the process private?

There are 2 solutions to solve this problem. And the choice depends on what you believe in!
Network based solution
If you want to keep your App Engine internal only, I means at network point of view, you can set the |ingress control to internal-only to accept only traffic coming from the VPC of your project
From there, you need to deploy your Cloud Functions, with a VPC connector (to route the traffic to the VPC) and set the egress control to All to route the traffic, public and private, to the VPC.
Indeed, even is you set your App Engine in ingress internal mode, the service is still publicly accessible, but there is an additional check on the request origin to be sure that it comes from the project VPCs. Therefore, when you call App Engine with your Cloud Functions, you call a public endpoint, and you need to route the public traffic to your VPC for being accept on App Engine internal only ingress.
This solution works only with VPC on the project. Cross project set up is impossible
Identity based solution
Google says: Don't trust the network
So, based on that, Google trust the identity of the traffic and request. You can keep your service private (not accessible by anyone except authorized access) only by controlling the authentication of the connection.
For that, you need to activate IAP on your App Engine service and to authorize only the service account of your Cloud Functions.
Then, in your cloud functions, you need to generate an identity token manually and to add it in the header of your request.
Be careful, there is a trap here. The audience is the IAP Client ID (that you can find in the APIs & Services -> Credential page)
Only the valid requests, checked by IAP, will trigger your App Engine service. In case of attacks, IAP will absorb the bad traffic, not App Engine.
So now, what do you trust?

What firewall rules and instance specs are needed to run a Flask app on google compute engine?

I am trying to deploy a Flask web app on google compute engine and am wondering:
What is the best instance type to use, is a g1-small sufficient?
What network traffic do I allow for the instance, HTTP and HTTPS or just one of them?
What port do I allow for the instance? I saw some people mentioned using tcp 5000.
Any other tips on instance or firewall specs would be much appreciated!

What is the best instance type to use, is a g1-small sufficient?
The answer depends on the traffic workload for your instance. Start with micro or small, monitor response time and adjust instance size to match the load.
What network traffic do I allow for the instance, HTTP and HTTPS or
just one of them?
That depends on what traffic/data you are serving. As a general rule, there is no reason to not implement HTTPS (SSL certificates) today.
What port do I allow for the instance? I saw some people mentioned
using tcp 5000.
You should not be serving using Flask's development server, which defaults to port 5000. Instead use a production server. You'll need to open whatever port you configure your server to listen on.

IP address as the hostname for AppHarbor / App Engine urlfetch to cache DNS lookups

I have an AppHarbor app that I'm using as an external service which will get requested by my other servers which use Google App Engine (python). The appharbor app is basically getting pinged a lot to process some data that I send it.
Because I'll be constantly pinging the service, and time is important, is it possible to reference my appharbor app through its IP address and not the hostname? Basically I want to eliminate having to do DNS lookups and speed up the response.
I'm using Google App Engine's urlfetch (https://developers.google.com/appengine/docs/python/urlfetch/overview) to do the request. Is caching the ip address something urlfetch is already doing under the covers? If not, is it possible to do so?

I doubt that DNS lookups will be your bottleneck, but anyway as far as I can see DNS lookups are cached by the system (for at least the TTL).

You can theoretically send requests directly to an IP address, but you would have to also pass the host header so that the AppHarbor routing layer can figure out what application gets the request.
As Shay mentions, you shouldn't do this though - DNS queries are cached and are not likely to be a bottleneck and you're setting yourself up for breakage because the IP address might change with the domain being pointed to a new IP.

Sign up for the AppEngine Sockets Trusted Tester (here) and use the normal python:
socket.gethostbyname(...)

Outbound FTP request from google appengine using python

I need to make an outbound ftp request to retrieve a number of small
files. There are 6 files each less than 10K and I only need to
retrieve them once every couple of hours.
When I try to do this with urllib2.urlopen("ftp://xxx.xxx.xxx") I get
an exception AttributeError: 'module' object has no attribute
'FTP_PORT'.
I have read through the documentation and see you are only allowed to
make http and https requests from the appengine, unfortunately my
application needs to consume the ftp data, does this requirement mean
I can't use the appengine at all ? I sincerely hope not.
So has anyone else here found a way to make ftp requests, perhaps with
a paid account ? And if not what have other people chosen to do ?
does azure or ec2 allow outbound ftp requests ?

You're correct. Google App Engine does not allow you to make FTP requests. Not even with a paid account.
I had to use a LAMP instance on EC2 that handles FTP'ing through CURL, and make http requests to it from GAE.

This limitation used to drive me nuts; implementing the overhead around dynamically instantiating EC2 slave workers to relay FTP data felt like a real waste of time. Fortunately, as of April 9 this year (SDK 1.7.7) this isn't a problem any longer. Outbound sockets (e.g. FTP) are generally available to all billing-enabled apps.
Sockets API Overview (Python): https://developers.google.com/appengine/docs/python/sockets/

drivehq.com is another option. It provides both a web+ftp server. So a third party I needed to interface with (that spoke only FTP) would upload files via FTP. And then I would urlfetch them on appengine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.