How to properly administrate a database using a script - python

I use a python script to insert data in my database (using pandas and sqlalchemy). The script read from various sources, clean the data and insert it in the database. I plan on running this script once in a while to completely override the existing database with more recent data.
At first I wanted to have a single service and simply add an endpoint that would require higher privileges to run the script. But in the end that looks a bit odd and, more importantly, that python script is using quite a lot of memory (~700M) which makes me wonder how I should configure my deployment.
Increasing the memory limit of my pod for this (once in a while) operation looks like a bad idea to me, but I'm quite new to Kubernetes, so maybe I'm wrong. Thus this question.
So what would be a good (better) solution? Run another service just for that, simply connect to my machine and run the update manually using the python script directly?

To run on demand
https://kubernetes.io/docs/concepts/workloads/controllers/job/.
This generates a Pod that runs till completion (exit) only once - a Job.
To run on schedule
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/.
Every time when hitting a schedule, this generates the new, separate Job.

Related

Profiling for a program using python-eve server

I am running a backend python-eve server with multiple functions being called to provide one service. I want to do profiling for this python backend server. I want to find out which among the multiple functionalities is taking time for execution. I have heard and used cprofiler but for a server that is continuously running, how do I do profiling? Moreover, I am using Pycharm IDE to work with the python code. So, it will be beneficial if there's a way I can do profiling using Pycharm.
While I do not have direct experience with python-eve, I wrote pprofile (pypi) mainly to use on Zope (also a long-running process).
The basic idea (at least on processes using worker threads, like Zope) is to start pprofile in statistic mode, let it collect samples for a while (how long heavily depends on how busy the process is and the level of detail you want to capture in the profiling result), and finally to build a profiling result archive, which in my case contains both the profiling result and all executed source code (so I'm extra-sure I'm annotating the correct version of the code).
I have so far not needed to do extra-long profiling session, like do one query to start profiling, then later another query to stop and fetch the result - or even to keep the result server-side and browse it somehow, and how to do this will likely heavily depend on server details.
You can find the extra customisation for Zope (allowing to fetch the python source out of Zope's Python Scripts, TAL expressions, and beyond pure python profiling it also collects object database loading durations, and even SQL queries run, to provide a broader picture) in pprofile.zope, just to give you an idea of what can be done.

python mysql, connect to latest DB of many?

I am currently running into what feels like a simple solution but perhaps I could get some help.
I am currently writing a python script that connects to a database server to retrieve the latest information, but I am running into one key issue.
There is no one 'database'. The tool that creates the data I need, creates a new 'database' every time it generates information. So when I connect to the server, there's literally 30+ databases, and it constantly creates new ones every week when the program runs its data collections.
So for example, I have a database called collection_2016_9_15, collection_2016_9_9, etc etc. This becomes a problem because when I want to run a query I need to tell python which DB to connect to, and this is supposed to be automated.
So right now, since it runs weekly, I know I can basically run the script the day the data is collected, and just say connect to database collection_Y%_M%_D%, but that will only work if I run it on the day the program runs, so if anything causes a delay or issues, it will break the automation.
So is there any way to tell python to connect to the 'most recent database' without trying to give it a specific name?

ansible runner runs to long

When I use ansible's python API to run a script on remote machines(thousands), the code is:
runner = ansible.runner.Runner(
module_name='script',
module_args='/tmp/get_config.py',
pattern='*',
forks=30
)
then, I use
datastructure = runner.run()
This takes too long. I want to insert the datastructure stdout into MySQL. What I want is if once a machine has return data, just insert the data into MySQL, then the next, until all the machines have returned.
Is this a good idea, or is there a better way?
The runner call will not complete until all machines have returned data, can't be contacted or the SSH session times out. Given that this is targeting 1000's of machines and you're only doing 30 machines in parallel (forks=30) it's going to take roughly Time_to_run_script * Num_Machines/30 to complete. Does this align with your expectation?
You could up the number of forks to a much higher number to have the runner complete sooner. I've pushed this into the 100's without much issue.
If you want max visibility into what's going on and aren't sure if there is one machine holding you up, you could run through each hosts serially in your python code.
FYI - this module and class is completely gone in Ansible 2.0 so you might want to make the jump now to avoid having to rewrite code later

postgres database: When does a job get killed

I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl
Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.
Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.

Uploading code to server and run automatically

I'm fairly competent with Python but I've never 'uploaded code' to a server before and have it run automatically.
I'm working on a project that would require some code to be running 24/7. At certain points of the day, if a criteria is met, a process is started. For example: a database may contain records of what time each user wants to receive a daily newsletter (for some subjective reason) - the code would at the right time of day send the newsletter to the correct person. But of course, all of this is running out on a Cloud server.
Any help would be appreciated - even correcting my entire formulation of the problem! If you know how to do this in any other language - please reply with your solutions!
Thanks!
Here are two approaches to this problem, both of which require shell access to the cloud server.
Write the program to handle the scheduling itself. For example, sleep and wake up every few miliseconds to perform the necessary checks. You would then transfer this file to the server using a tool like scp, login, and start it in the background using something like python myscript.py &.
Write the program to do a single run only, and use the scheduling tool cron to start it up every minute of the day.
Took a few days but I finally got a way to work this out. The most practical way to get this working is to use a VPS that runs the script. The confusing part of my code was that each user would activate the script at a different time for themselves. To do this, say at midnight, the VPS runs the python script (using scheduled tasking or something similar) and runs the script. the script would then pull times from a database and process the code at those times outlined.
Thanks for your time anyways!

Categories

Resources