I use a python script to insert data in my database (using pandas and sqlalchemy). The script read from various sources, clean the data and insert it in the database. I plan on running this script once in a while to completely override the existing database with more recent data.
At first I wanted to have a single service and simply add an endpoint that would require higher privileges to run the script. But in the end that looks a bit odd and, more importantly, that python script is using quite a lot of memory (~700M) which makes me wonder how I should configure my deployment.
Increasing the memory limit of my pod for this (once in a while) operation looks like a bad idea to me, but I'm quite new to Kubernetes, so maybe I'm wrong. Thus this question.
So what would be a good (better) solution? Run another service just for that, simply connect to my machine and run the update manually using the python script directly?
To run on demand
https://kubernetes.io/docs/concepts/workloads/controllers/job/.
This generates a Pod that runs till completion (exit) only once - a Job.
To run on schedule
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/.
Every time when hitting a schedule, this generates the new, separate Job.
As part of my homework I need to load large data files into two MySQL tables, parsed using Python, on my guest machine that is called via Vagrant SSH.
I also then need to run a Sqoop job on one of the 2 tables so now I'm up to the point where I loaded one of the tables successfully and ran the Python script to load the second table and it's been more than 3 hours and still loading.
I was wondering whether I could complete my Sqoop job on the already loaded table instead of staring at a black screen for almost 4 hours now.
My questions are:
Is there any other way to Vagrant SSH into the same machine without doing Vagrant reload (because --reload eventually shuts down my virtual machine thereby killing all the current jobs running on my guests).
If there is, then given that I open a parallel window to log in to the guest machine as usual and start working on my Sqoop job on the first table that already loaded; will it any way affect my current job with the second table that is still loading? Or will it have a data loss as I can't risk re-doing it as it is super large and extremely time-consuming.
python code goes like this
~~
def parser():
with open('1950-sample.txt', 'r', encoding='latin_1') as input:
for line in input:
....
Inserting into tables
def insert():
if (tablename == '1950_psr'):
cursor.execute("INSERT INTO 1950_psr (usaf,wban,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(USAF,WBAN,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press))
elif (tablename == '1986_psr'):
cursor.execute("INSERT INTO 1986_psr (usaf,wban,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(USAF,WBAN,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press))
parser()
Saving & closing
conn.commit()
conn.close()
I don't know what's in your login scripts, and I'm not clear what that --reload flag is, but in general you can have multiple ssh sessions to the same machine. Just open another terminal and ssh into the VM.
However, in your case, that's probably not a good idea. I suspect that the second table is taking long time to load because your database is reindexing or it's waiting on a lock to be released.
Unless you are loading hundreds of meg's, I suggest you first check for locks and see what queries are pending.
Even if you are loading very large dataset and there are no constraints on the table you need for you script, you are just going to pile up on a machine that's already taxed pretty heavily...
I'm looking for a way to update lots of rows in a DB(100K+). I'm connecting to the db remotely and updating the rows but it takes way to much time. If i download the db to my local machine and update them rows there its fast. I'm looking for a way to process the db on my computer then send only the changed part like all together. For example I can write a script which would download the db and edit it on my local machine then send it back and that takes way less time than doing the update. What I'm looking for is sending back only the column where the edit takes place and insert it into the server machine.
Update: Looks like its not the update script which is causing the bottlenect. Still need to find a way to figure it out whats causing the bottleneck
I'm fairly competent with Python but I've never 'uploaded code' to a server before and have it run automatically.
I'm working on a project that would require some code to be running 24/7. At certain points of the day, if a criteria is met, a process is started. For example: a database may contain records of what time each user wants to receive a daily newsletter (for some subjective reason) - the code would at the right time of day send the newsletter to the correct person. But of course, all of this is running out on a Cloud server.
Any help would be appreciated - even correcting my entire formulation of the problem! If you know how to do this in any other language - please reply with your solutions!
Thanks!
Here are two approaches to this problem, both of which require shell access to the cloud server.
Write the program to handle the scheduling itself. For example, sleep and wake up every few miliseconds to perform the necessary checks. You would then transfer this file to the server using a tool like scp, login, and start it in the background using something like python myscript.py &.
Write the program to do a single run only, and use the scheduling tool cron to start it up every minute of the day.
Took a few days but I finally got a way to work this out. The most practical way to get this working is to use a VPS that runs the script. The confusing part of my code was that each user would activate the script at a different time for themselves. To do this, say at midnight, the VPS runs the python script (using scheduled tasking or something similar) and runs the script. the script would then pull times from a database and process the code at those times outlined.
Thanks for your time anyways!
I am a self taught Python programmer and I have an idea for a project that I'd like to use to better my understanding of Socket programming and networking in general. I was hoping that someone would be able to tell me if I am on the right path, or point me in another direction.
The general idea is to be able to update a database via a website UI, then a Python Server would then be consistently checking that database for changes. If it notices changes it would then hand out a set of instructions to the first available connected client.
The objective is to
A - Create a server that reads from a database.
B - Create clients that connect to said server from a remote machine.
C - The server would then consistently read the database and and look for changes, for instance a column that is a Boolean that would signify Run/Don't run.
D - If say the run Boolean is true, the server than would hand the instructions off to the first available client.
E - The client itself would then handle updating the database of certain runtime occurrences
Questions/Concerns
A - My first concern is the resources it would take to be constantly reading the database and searching for changes. Is there a better way of doing this? Or Could I write a loop to do this and not worry much about it?
B - I have been reading the documentation/tutorials on Twisted and at this moment this looks like a viable option to be able to handle the connections of multiple clients (20-30 for arguments sake). From what I've read Threading looks to be more of a hassle than it's worth.
Am I on the right track? Any suggestions? Or reading material worth looking at?
Thank you in advance
The Django web framework implements something called Signals which at the ORM level lets you detect changes to specific database objects and attach handlers to those changes. Since it is an open source project, you might want to check the source code to understand how Django does it while supporting multiple database backends (Mysql, postgres, oracle, sqlite).
If you directly want to listen to database changes then most databases have some kind of a system that logs every transactional change. For example MySql has the binary log you can keep reading from to detect changes to the database.
Also while Twisted is great I would recommend you use Gevent/Greenlet + this. If you want to integrate with Django, how to combine django plus gevent the basics? would help