I have a total of 25-30 scripts that updates data on the server. Since every day new data gets generated, so every day I have to run these scripts manually in a sequence and update the database with new data. All these scripts contain 300+ lines of SQL queries.
Now I want to automate this task with Python, but not really sure how to do that.
I have used some libraries in past to connect to SQL server and then define a cursor to execute certain queries - cur.execute(select * from abc)
But now I want to automate this task and run all scripts by passing just the script names.
Like
cur.execute(sql_script_1)
cur.execute(sql_script_2)
cur.execute(sql_script_3)
.
.
.
cur.execute(sql_script_25)
In this way, in the end, I'll just have to run this .py file and it will automatically run all scripts in the given order.
Can this be done somehow? Either in this way or some other way.
The main motive is to automate the task of running all scripts by just passing the names.
Your question is probably a bad one, kinda vague and no effort shown in terms of research or implementing it yourself. But it's possible I had a similar use case, so I will share.
In my case I needed to pull down data as a pandas df. Your case might vary but I imagine basic structure will remain same.
In any case here is what I did:
you store each of the sql scripts as a string variable in a python file (or files) somewhere in your project directory.
you define a function that manages the connection and passes a sql script as an argument.
you call that function as needed for each script.
So your python file in one looks something like:
sql_script_1 = 'select....from...where...'
sql_script_2 = 'select....from...where...'
sql_script_3 = 'select....from...where...'
Then you define the function in two to manage the connection.
Something like
import pandas as pd
import pyodbc
def query_passer(query:str, conn_db):
conn = pyodbc.connect(conn_db)
df = pd.read_sql(query, conn)
conn.close()
return df
Then in three you call the function, do whatever you were gonna do with the data, and then repeat for each query.
The code below assume the query_passer function was saved in a python file named "functions.py" in the subfolder "resources" and that the queries are stored in a file named "query1.py" in the subfolder "resources.queries". Organize your own files as you wish, or just keep them all in one big file.
from resources.functions import query_passer
from resources.queries.query1 import sql_script_1, sql_script_2, sql_script_3
import pandas as pd
# define the connection
conn_db = (
"Driver={stuff};"
"Server=stuff;"
"Database=stuff;"
".....;"
)
# run queries
df = query_passer(sql_script_1, conn_db)
df.to_csv('sql_script_1.csv')
df = query_passer(sql_script_2, conn_db)
df.to_csv('sql_script_2.csv')
df = query_passer(sql_script_3, conn_db)
df.to_csv('sql_script_3.csv')
Related
I have a few Python files that I want to execute sequentially in order to get an output and now would like to automate this process. So I would like to have a parent script from which I can execute all my child scrips in the right order. Also, I would like to execute one of the files twice but with two different date variables and would like to store the outputs in two different folders. How do I create such a parent script in Python?
For example I want to execute file1.py first and the date (a variable in file1.py) should be date1 and the output should be stored in dir1 and then I want to execute file1.py again but this time the date should be date2 and the output should be in dir2. And the I want to execute file2.py.
How would I do this?
You can easily run python scripts from another python scripts using subprocesses. Something like:
import subprocess
subprocess.Popen("script2.py some_argument")
Problem with using subprocesses - it's quite annoying to get results from it (you can print results from the script and then get them in the parent file, but still).
Better solution - save middle results in some database (like simple SQLite file), so you use your main script to initiate child scripts, but get arguments from the database and write child script results to the database too. It's quite easy and could solve your problems (https://docs.python.org/3/library/sqlite3.html).
For example, to save some variable in the SQLite database, all you need is:
import sqlite3
conn = sqlite3.connect("mydatabase.db")
cursor = conn.cursor()
# Create table (no need if it was created before)
cursor.execute("""CREATE TABLE example_table
(variable_name, variable_value)
""")
# Save changes
conn.commit()
# Insert data
cursor.execute("""INSERT INTO example_table
VALUES ('some_variable', 'some_value')"""
# Save changes
conn.commit()
I'm trying to populate a SQLite database using Django with data from a file that consists of 6 million records. However the code that I've written is giving me a lot of time issues even with 50000 records.
This is the code with which I'm trying to populate the database:
import os
def populate():
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns=col[1]
name=col[8]
job=col[12]
dun_add = add_c_duns(duns)
add_contact(c_duns = dun_add, fn=name, job=job)
def add_contact(c_duns, fn, job):
c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
return c
def add_c_duns(duns):
cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
return cd
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate()
print "Done!!"
The code works fine since I have tested this with dummy records, and it gives the desired results. I would like to know if there is a way using which I can lower the execution time of this code. Thanks.
I don't have enough reputation to comment, but here's a speculative answer.
Basically the only way to do this through django's ORM is to use bulk_create . So the first thing to consider is the use of get_or_create. If your database has existing records that might have duplicates in the input file, then your only choice is writing the SQL yourself. If you use it to avoid duplicates inside the input file, then preprocess it to remove duplicate rows.
So if you can live without the get part of get_or_create, then you can follow this strategy:
Go through each row of the input file and instantiate a Contact_DUNS instance for each entry (don't actually create the rows, just write Contact_DUNS(duns=duns) ) and save all instances to an array. Pass the array to bulk_create to actually create the rows.
Generate a list of DUNS-id pairs with value_list and convert them to a dict with the DUNS number as the key and the row id as the value.
Repeat step 1 but with Contact instances. Before creating each instance use the DUNS number to get the Contact_DUNS id from the dictionary of step 2. The instantiate each Contact in the following way: Contact(duns_id=c_duns_id, fullName=fn, title=job). Again, after collecting the Contact instances just pass them to bulk_create to create the rows.
This should radically improve performance as you'll be no longer executing a query for each input line. But as I said above, this can only work if you can be certain that there are no duplicates in the database or the input file.
EDIT Here's the code:
import os
def populate_duns():
# Will only work if there are no DUNS duplicates
# (both in the DB and within the file)
duns_instances = []
with open("filename") as f:
for line in f:
duns = line.strip().split("|")[1]
duns_instances.append(Contact_DUNS(duns=duns))
# Run a single INSERT query for all DUNS instances
# (actually it will be run in batches run but it's still quite fast)
Contact_DUNS.objects.bulk_create(duns_instances)
def get_duns_dict():
# This is basically a SELECT query for these two fields
duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
return dict(duns_id_pairs)
def populate_contacts():
# Repeat the same process for Contacts
contact_instances = []
duns_dict = get_duns_dict()
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns = col[1]
name = col[8]
job = col[12]
ci = Contact(duns_id=duns_dict[duns],
fullName=name,
title=job)
contact_instances.append(ci)
# Again, run only a single INSERT query
Contact.objects.bulk_create(contact_instances)
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate_duns()
populate_contacts()
print "Done!!"
CSV Import
First of all 6 million records is a quite a lot for sqllite and worse still sqlite isn't very good and importing CSV data directly.
There is no standard as to what a CSV file should look like, and the
SQLite shell does not even attempt to handle all the intricacies of
interpreting a CSV file. If you need to import a complex CSV file and
the SQLite shell doesn't handle it, you may want to try a different
front end, such as SQLite Database Browser.
On the other hand Mysql and Postgresql are more capable of handling CSV data and mysql's LOAD DATA IN FILE and Postgresql COPY are both painless ways to import very large amounts of data in a very short period of time.
Suitability of Sqlite.
You are using django => you are building a web app => more than one user will access the database. This is from the manual about concurrency.
SQLite supports an unlimited number of simultaneous readers, but it
will only allow one writer at any instant in time. For many
situations, this is not a problem. Writer queue up. Each application
does its database work quickly and moves on, and no lock lasts for
more than a few dozen milliseconds. But there are some applications
that require more concurrency, and those applications may need to seek
a different solution.
Even your read operations are likely to be rather slow because an sqlite database is just one single file. So with this amount of data there will be a lot of seek operations involved. The data cannot be spread across multiple files or even disks as is possible with proper client server databases.
The good news for you is that with Django you can usually switch from Sqlite to Mysql to Postgresql just by changing your settings.py. No other changes are needed. (The reverse isn't always true)
So I urge you to consider switching to mysql or postgresl before you get in too deep. It will help you solve your present problem and also help to avoid problems that you will run into sooner or later.
6,000,000 is quite a lot to import via Python. If Python is not a hard requirement, you could write a SQLite script that directly import the CSV data and create your tables using SQL statements. Even faster would be to preprocess your file using awk and output two CSV files corresponding to your two tables.
I used to import 20,000,000 records using sqlite3 CSV importer and it took only a few minutes minutes.
Right now, I have a Django application with an import feature which accepts a .zip file, reads out the csv files and formats them to JSON and then inserts them into the database. The JSON file with all the data is put into temp_dir and is called data.json.
Unfortunatly, the insertion is done like so:
Building.objects.all().delete()
call_command('loaddata', os.path.join(temp_dir, 'data.json'))
My problem is that all the data is deleted then re-added. I need to instead find a way to update and add data and not delete the data.
I've been looking at other Django commands but I can't seem to find out that would allow me to insert the data and update/add records. I'm hoping that there is a easy way to do this without modifying a whole lot.
If you loop through your data you could use get_or_create(), this will return the object if it exist and create it if it doesn't:
obj, created = Person.objects.get_or_create(first_name='John', last_name='Lennon', defaults={'birthday': date(1940, 10, 9)})
I have a script built using procedural programming that uses a sqlite database file. The script processes a CSV file, then uses a standard cursor to pass its particulars to a single SQLite DB.
After this, the script extracts from the DB to produce a number of spreadsheets in Excel via xlwt.
The problem with this is that the script only handles one input file at a time, whereas I will need to be iterating through about 70-90 of these files on any given day.
I've been trying to rewrite the script as object-oriented, but I'm having trouble with sharing the cursor.
The original input file comes in a zip archive that I would extract via Linux or Mac OS X command lines. Previously this was done manually; now I've managed to write classes and loop through totally ad-hoc numbers of multiple input files via the multi version of tkfiledialog.
Furthermore the original input file (ie one of the 70-90) is a text file in csv format with a DRF extension (obv., not really important) that gets picked by a simple tkfiledialog box:
FILENAMER = tkFileDialog.askopenfilename(title="Open file", \
filetypes=[("txt file",".DRF"),("txt file", ".txt"),\
("All files",".*")])
The DRF file itself is issued daily by location and date, ie 'BHP0123.DRF' is for the BHP location, issued for 23 January. To keep everything as straightforward as possible the procedural script further decomposes the DRF to just the BHP0123 part or prefix, then uses it to build a SQLite DB.
FBASENAME = os.path.basename(FILENAMER)
FBROOT = os.path.splitext(FBNAMED)[0]
OUTPUTDATABASE = 'sqlite_' + FBROOT + '.db'
Basically with the program as a procedural script I just had to create one DB, one connection and one cursor, which could be shared by all the functions in the script:
conn = sqlite3.connect(OUTPUTDATABASE) # <-- originally :: Is this the core problem?
curs = conn.cursor()
conn.text_factory = sqlite3.OptimizedUnicode
In the procedural version these variables above are global.
Procedurally I have
1) one function to handle formatting, and
2) another to handle the calculations needed. The DRF is indexed with about 2500 fields per row; I discard the majority and only use about 400-500 of these per row.
The formatting function parses out the CSV via a for-loop (discards junk characters, incomplete data, etc), then passes the formatted data for the calculator to process and chew on. The core problem seems to be that on the one hand I need the DB connection to be constant for each input DRF file, but on the other that connection can only be shared by the formatter and calculator, and 'regenerated' for each DRF.
Crucially, I've tried to rewrite as little of the formatter and calculator as possible.
I've tried to create a separate dbcxn class, then create an instance to share, but I'm confused as to how to handle the output DB situation with the cursor (and pass it intact to both formatter and calculator):
class DBcxn(object):
def __init__(self, OUTPUTDATABASE):
OUTPUTDATABASE = ?????
self.OUTPUTDATABASE = OUTPUTDATABASE
def init_db_cxn(self, OUTPUTDATABASE):
conn = sqlite3.connect(OUTPUTDATABASE) # < ????
self.conn = conn
curs = conn.cursor()
self.curs = curs
conn.text_factory = sqlite3.OptimizedUnicode
dbtest = DBcxn( ???? )
If anyone might suggest a way of untangling this I'd be very grateful. Please let me know if you need more information.
Cheers
Massimo Savino
I'm using psycopg2 to copy csvs into postgres. all my csvs will always have the same suffix, such as competitors or customers, along with a unique number each time i generate them. therefore I want a python script that can run to use COPY and I can just have 'dir/customers*.csv'
is there an easy way to do this, does asterisk exist as some other character in python?
Thanks in advance!
Ok so what I currently have is a geoprocessing model in ArcGIS that will generate a bunch of CSVs into C:\Data\Sheltered BLPUs\CSVs such as competitors and customers, but attaches a unique ID for each location onto the end of the filename e.g. customers17.csv. I then want a python script to copy all of these into postgres. All the tables exist as 'customers' or 'competitors' - this is what I wanted to write:
cur.execute("COPY competitors FROM 'C:/Data/Sheltered BLPUs/CSVs/competitors*.csv' DELIMITER ',' CSV;")
import glob
print glob.glob('dir/customers*.csv')