I am new to AWS EC2 so that I make this post for some questions.
1) Right now, I am considering running some script on the server. I use two tools usually. One is a software can only be used in Windows. The other is just python. Should I open two instances, one for windows, one for ubuntu? Or just one instance of Windows with Git Bash installed? I want to be cost and performance efficiently.
2) I am not going to use the script very often (usually 2-3 hours per day or 10-12 hours per week). Therefore, is it easy to schedule those jobs automatically across the instances? I mean it can automatically turn off and restart given appropriate time.
3) Some of the script involves web scraping. I am also wondering if it is ok to switch IP address every time I run the script. Mainly, it is for python script.
Thanks.
1) Well, off course, the less instances you have, the less you will pay. Python can run on Windows, I just don't know how tricky it would be to make it work in your case. It all depends on what you are running and what are your management requirements. Those script languages were originally designed for Unix environments, so people usually runs it on those kind of systems, so running it in Windows may be a little unpleasant. Anyway, I don't think you should ask someone else it, you should figure it out yourself what suits you best.
2) AWS doesn't have a scheduler for EC2 (stop, starting, etc, given date/times/recurrence). It's something that I miss on it too. So, to achieve something like this you have some options.
Turning your temporary instance into an auto-scaling group of 1 instance, and scheduling policies to scale it in to zero instances and scale it out to 1 instance again when you want. The problem with this approach is: if you can't be sure how long it will take for your job to be completed, then you have a problem, off course, because those scheduled actions are based in fixed date/times. One solution for this would be the temporary instance itself changing the autoscaling group configuration to zero instances via API when it has finished. (In this case, you would just have a scale out scheduled policy, to launch the instance, leaving the termination of it to be done 'manually', via auto-scaling group configuration handling from inside the temporary instance). But be aware that auto-scaling is very tricky for begginers, and you should go throught the documentation before using it. (For exemple, each time you scale in and out you instances, they're terminated, not just stopped, and you lose every data on it.)
Not using auto-scaling group, having a regular instance, and scheduling all those actions from outside it via API. It could be from your Windows (master) instance. In this case, the master would start the temporary instance via API, which would run its things and then turn itself off when it had finished. Otherwise, the master instance would have to keep polling the temporary one somehow to know when the jobs are done and it can be shutdown from outside.
There are probably more complicated ways for doing this (Elastic Beanstalk crons, maybe).
I think, in this case, the more simple, the better. So, I would stick to the option 2). You will only need to figure how to install and use AWS CLI on Windows and manage IAM credentials and permissions to provide your CLI access enough for it to do what it needs.
3) If you don't assign an Elastic IP to your instance, you will get a different IP each time you stop and start it, so this is, by default, what you want. In auto-scaling, this is the only way, you can't even assign a fixed IP to instances.
I hope I could help you a little bit.
Related
Background
I have some dags that pull data from an 3rd-party api.
The accounts we need to pull can change over time. To determine which accounts to pull, depending on the process we may need to query a database or make an HTTP request.
Before airflow, we would just get the account list at the start of the python script. Then we would iterate through the account list and pull each account to file or whatever it was we needed to do.
But now, using airflow, it makes sense to define tasks at the account level and let airflow handle retry functionality and date range and parallel execution etc.
Thus my dag might look something like this:
Problem
Since each account is a task, the account list needs to be accessed with every dag parse. But since dag files are parsed frequently, you don't necessarily want to query the database or wait for a REST call with every dag parse from every machine all day long. This could be resource intensive, and could cost money.
Question
Is there a good way to cache this type of config information in a local file, ideally with a specified time-to-live?
Thoughts
I have thought about a couple different approaches:
write to csv or pickle file and use mtime to expire.
the concern with this is that i might get collisions if two processes try to expire the file at the same time. i don't know how likely this is or what the consequences would be but probably nothing terrible.
create a common sqlite DB for all such processes. should be auto created first time a variable is accessed. each config variable gets a row in table. use last_modified_datetime column to tell when to expire.
requires more elaborate code & dependencies.
use airflow variables
nice thing about this would be that it uses existing DB, so would be no $ per query and reasonable network lag, but it still requires network round trip.
has benefit of being identical across all nodes in a multi-node setup.
determining when to expire would probably be problematic so would probably create config manager dag to update the config variables periodically.
but then this would add complexity to deployment and devolpment process -- the variables need to be populated in order to define the DAGs properly -- all developers would need to manage this locally too, as opposed to a more create-on-read cacheing approach.
Subdags?
never used them, but I have a suspicion they could be used here. But the community seems to discourage their use anyway...
Have you dealt with this problem? Did you arrive at a good solution? None of these seems very good.
Airflow default DAG parsing interval is pretty forgiving: 5 minutes. But even that is quite a lot for most people, so it's quite reasonable to increase that if your deployment isn't too close to the due times for the new DAGs.
In general, I'd say it's not that bad to make a REST request at every DAG parse heartbeat. Also, nowadays the scheduling process is decoupled from the parsing process, so that won't affect how fast your tasks are scheduled. Airflow caches the DAG definition for you.
If you think you still have reasons to put your own cache on top of that, my suggestion is to cache at the definitions server, not on the Airflow side. For example, using cache headers on the REST endpoint and handling cache invalidation yourself when you need it. But that could be some premature optimization, so my advice is to start without it and implement it only if you measure convincing evidence that you need it.
EDIT: regarding Webserver and Worker
It's true that the Webserver will trigger DAG Parses as well, not sure about how frequent. Probably following the guicorn workers refresh interval (which is 30 seconds by default). Workers will do it also by default at the start of every task, but that can be saved if you activate pickling DAGs. Not sure if that's a good idea though, I've heard this is something destined to be deprecated.
One other thing you can try to do is to cache that in the Airflow process itself, memoizing the function that makes the expensive request. Python has a built-in functools for that (lru_cache) and together with pickling it might be enough and very very much easier than the other options.
I have the same exact scenario.
Have API call for multiple accounts. Initially created a python script to iterate the list.
When I started using Airflow thought about what you are planning to do. Tried 2 of the alternatives you listed. After some experimentation decided to handle retry logic within python with simple try-except blocks if HTTP calls fail. Reasons are
One script to maintain
Less Airflow objects
Restartability is easier with one script in place.
(restarting failed job in Airflow is not a breeze (no pun intended))
At the end it's up to you, that was my experience.
I have a use case, where I have an EC2 instance with Fedora Linux and some applications running. When this instance fails, I have to spin up a new instance with the same OS and install the applications. I am trying to do in Ansible (and Python), I'm a complete novice and have no idea how to do it.
For my applications, I have a variable (a structure) that tells me how many of each type of server I need in each of three subnets. The environment creation playbook loops through that structure and builds however many are needed to fill the requirements. So if I need five (5) and only have three (3), it builds two (2). I use exact_count for that in the ec2 module.
So if one fails, I can delete that instance and re-run my create playbook, which will also re-write all the configuration files on the other servers that they use to communicate with each other. For instance, if I delete a JBoss server create a new one to replace it, the load balancer has to know about it.
Good practise here would be to have a base image that covers what you need, use that as a feeder for an AMI, and then plug it into an Auto-scaling group. As part of the auto-scaling group, you can use user-data to load specific updates/etc onto the instance at boot time.
Autoscale group min 1 max 1 will do exactly what you want, if you can configure it the above way.
I have a Python program that I am running as a Job on a Kubernetes cluster every 2 hours. I also have a webserver that starts the job whenever user clicks a button on a page.
I need to ensure that at most only one instance of the Job is running on the cluster at any given time.
Given that I am using Kubernetes to run the job and connecting to Postgresql from within the job, the solution should somehow leverage these two. I though a bit about it and came with the following ideas:
Find a setting in Kubernetes that would set this limit, attempts to start second instance would then fail. I was unable to find this setting.
Create a shared lock, or mutex. Disadvantage is that if job crashes, I may not unlock before quitting.
Kubernetes is running etcd, maybe I can use that
Create a 'lock' table in Postgresql, when new instance connects, it checks if it is the only one running. Use transactions somehow so that one wins and proceeds, while others quit. I have not yet thought this out, but is should work.
Query kubernetes API for a label I use on the job, see if there are some instances. This may not be atomic, so more than one instance may slip through.
What are the usual solutions to this problem given the platform choice I made? What should I do, so that I don't reinvent the wheel and have something reliable?
A completely different approach would be to run a (web) server that executes the job functionality. At a high level, the idea is that the webserver can contact this new job server to execute functionality. In addition, this new job server will have an internal cron to trigger the same functionality every 2 hours.
There could be 2 approaches to implementing this:
You can put the checking mechanism inside the jobserver code to ensure that even if 2 API calls happen simultaneously to the job server, only one executes, while the other waits. You could use the language platform's locking features to achieve this, or use a message queue.
You can put the checking mechanism outside the jobserver code (in the database) to ensure that only one API call executes. Similar to what you suggested. If you use a postgres transaction, you don't have to worry about your job crashing and the value of the lock remaining set.
The pros/cons of both approaches are straightforward. The major difference in my mind between 1 & 2, is that if you update the job server code, then you might have a situation where 2 job servers might be running at the same time. This would destroy the isolation property you want. Hence, database might work better, or be more idiomatic in the k8s sense (all servers are stateless so all the k8s goodies work; put any shared state in a database that can handle concurrency).
Addressing your ideas, here are my thoughts:
Find a setting in k8s that will limit this: k8s will not start things with the same name (in the metadata of the spec). But anything else goes for a job, and k8s will start another job.
a) etcd3 supports distributed locking primitives. However, I've never used this and I don't really know what to watch out for.
b) postgres lock value should work. Even in case of a job crash, you don't have to worry about the value of the lock remaining set.
Querying k8s API server for things that should be atomic is not a good idea like you said. I've used a system that reacts to k8s events (like an annotation change on an object spec), but I've had bugs where my 'operator' suddenly stops getting k8s events and needs to be restarted, or again, if I want to push an update to the event-handler server, then there might be 2 event handlers that exist at the same time.
I would recommend sticking with what you are best familiar with. In my case that would be implementing a job-server like k8s deployment that runs as a server and listens to events/API calls.
I'm trying to create a background service in Python. The service will be called from another Python program. It needs to run as a daemon process because it uses a heavy object (300MB) that has to be loaded previously into the memory. I've had a look at python-daemon and still haven't found out how to do it. In particular, I know how to make a daemon run and periodically do some stuff itself, but I don't know how to make it callable from another program. Could you please give some help?
I had a similar situation when I wanted to access a big binary matrix from a web app.
Of course there are many solutions, but I used Redis, a popular in-memory database/cache system, to store and access my object successfully. It has practical Python bindings (several probably equivalent wrapper libraries).
The main advantage is that when the service goes down, a copy of the data still remains on disk. Also, I noticed that once in place, it could be used for other things in my app (for instance Celery proposes it as backend), and actually, for other services in any other unrelated program.
I am looking into starting a project which involves executing python code that the user enters via a HTML form. I know this can be potentially lethal (exec), but I have seen it done successfully in at least one instance.
I sent an email off to the developers of the Python Challenge and I was told they are using a solution they came up with themselves, and they only let on that they are using "security features provided by the operating system" and that "the operating system [Linux] provides most of the security you need if you know how to use it."
Would anyone know how a safe and secure way to go about doing this? I thought about spawning a new VM for every submission, but that would have way too much overhead and be pert-near impossible to implement efficiently.
On a modern Linux in addition to chroot(2) you can restrict process further by using clone(2) instead of fork(2). There are several interesting clone(2) flags:
CLONE_NEWIPC (new namespace for semaphores, shared memory, message queues)
CLONE_NEWNET (new network namespace - nice one)
CLONE_NEWNS (new set of mountpoints)
CLONE_NEWPID (new set of process identifiers)
CLONE_NEWUTS (new hostname, domainname, etc)
Previously this functionality was implemented in OpenVZ and merged then upstream, so there is no need for patched kernel anymore.
http://codepad.org/about has implemented such a system successfully (as a public code pasting/running service!)
codepad.org is an online compiler/interpreter, and a simple collaboration tool. It's a pastebin that executes code for you. [...]
How it works
Code execution is handled by a supervisor based on geordi. The strategy is to run everything under ptrace, with many system calls disallowed or ignored. Compilers and final executables are both executed in a chroot jail, with strict resource limits. The supervisor is written in Haskell.
[...]
When your app is remote code execution, you have to expect security problems. Rather than rely on just the chroot and ptrace supervisor, I've taken some additional precautions:
The supervisor processes run on virtual machines, which are firewalled such that they are incapable of making outgoing connections.
The machines that run the virtual machines are also heavily firewalled, and restored from their source images periodically.
If you run the script as user nobody (on Linux), it can write practically nowhere and read no data that has its permissions set up properly. But it could still cause a DoS attack by, for example:
filling up /tmp
eating all RAM
eating all CPU
Furthermore, outside network connections can be opened, etcetera etcetera. You can probably lock all these down with kernel limits, but you are bound to forget something.
So I think that a virtual machine with no access to the network or the real hard drive would be the only (reasonably) safe route. Perhaps the developers of the Python Challenge use KVM which is, in principle, "provided by the operating system".
For efficiency, you could run all submissions in the same VM. That saves you much overhead, and in the worst-case scenario they only hamper each other, but not your server.
Using chroot (Wikipedia) may be part of the solution, e.g. combined with ulimit and some other common (or custom) tools.