Data Integration Structure using Python and/or SSIS

Data Integration Structure using Python and/or SSIS - python

I have a question on the general strategy of how to integrate data into an MSSQL database.
Currently, I use python for my whole ETL process. I use it to clean, transform, and integrate the data in an MSSQL database. My data is small so I think this process works fine for now.
However, I think it a little awkward for my code to constantly read data and write data to the database. I think this strategy will be an issue once I'm dealing with large amount of data and the constant read/write seems very inefficient. However, I don't know enough to know if this is a real problem or not.
I want to know if this is a feasible approach or should I switch entirely to SSIS to handle it. SSIS to me is clunky and I'd prefer not to re-write my entire code. Any input on the general ETL architecture would be very helpful.

Is this practice alright? Maybe?
There are too many factors to give a definitive answer. Conceptually, what you're doing - Extract data from source, Transform it, Load it to destination, ETL, is all that SSIS does. It likely can do things more efficiently than python - at least I've had a devil of a time getting a bulk load to work with memory mapped data. Dump to disk and bulk insert that via python - no problem. But, if the existing process works, then let it go until it doesn't work.
If your team knows Python, introducing SSIS just to do ETL is likely going to be a bigger maintenance cost than scaling up your existing approach. On the other hand, if it's standard-ish Python + libraries and you're on SQL Server 2017+, you might be able to execute your scripts from within the database itself via sp_execute_external_script
If the ETL process runs on the same box as the database, then ensure you have sufficient resources to support both processes at their maximum observed levels of activity. If the ETL runs elsewhere, then you'll want to ensure you have fast, full duplex connectivity between the database server and the processing box.
Stand up a load testing environment that parallels production's resources. Dummy up a 10x increase in source data and observe how the ETL fares. 100x, 1000x. At some point, you'll identify what development sins you committed that do not scale and then you're poised to ask a really good, detailed question describing the current architecture, the specific code that does not perform well under load and how one can reproduce this load.
The above design considerations will hold true for Python, SSIS or any other ETL solution - prepackaged or bespoke.

Related

What is the best way to schedule a view in database ( Clickhouse ) into a table, that updates the data at the specified time

Currently, the approach I take is,
clearing the rows in the table using python,
fetching the output of the view , in python and storing the result in a df
appending the data to the table using df.to_Sql in python.
Scheduling this script to be run every day at a specified time( prefect ).
I find this method, unappealing is because of the following reasons:
This method external , hence it involves latency.
This method is subject to various dependencies, like the sql connector that I am using for python, the scheduler like prefect, where debugging can get tricky If I have more than 10 tables..
Is there a better way/ package / tool to automate the process with least dependencies and latency ?

Have you tried Prefect 2 already? Regarding the load process, you may consider loading data to a temp table and merging from there -- by doing that in SQL, it might be faster and easier to troubleshoot. dbt is also a tool you can consider, and you can orchestrate dbt with prefect using the prefect-dbt package: https://github.com/PrefectHQ/prefect-dbt

Automate a manual task using Python

I have a question and hope someone can direct me in the right direction; Basically every week I have to run a query (SSMS) to get a table containing some information (date, clientnumber, clientID, orderid etc) and then I copy all the information and that table and past it in a folder as a CSV file. it takes me about 15 min to do all this but I am just thinking can I automate this, if yes how can I do that and also can I schedule it so it can run by itself every week. I believe we live in a technological era and this should be done without human input; so I hope I can find someone here willing to show me how to do it using Python.
Many thanks for considering my request.

This should be pretty simple to automate:
Use some database adapter which can work with your database, for MSSQL the one delivered by pyodbc will be fine,
Within the script, connect to the database, perform the query, parse an output,
Save parsed output to a .csv file (you can use csv Python module),
Run the script as the periodic task using cron/schtask if you work on Linux/Windows respectively.

Please note that your question is too broad, and shows no research effort.
You will find that Python can do the tasks you desire.
There are many different ways to interact with SQL servers, depending on your implementation. I suggest you learn Python+SQL using the built-in sqlite3 library. You will want to save your query as a string, and pass it into an SQL connection manager of your choice; this depends on your server setup, there are many different SQL packages for Python.
You can use pandas for parsing the data, and saving it to a ~.csv file (literally called to_csv).
Python does have many libraries for scheduling tasks, but I suggest you hold off for a while. Develop your code in a way that it can be run manually, which will still be much faster/easier than without Python. Once you know your code works, you can easily implement a scheduler. The downside is that your program will always need to be running, and you will need to keep checking to see if it is running. Personally, I would keep it restricted to manually running the script; you could compile to an ~.exe and bind to a hotkey if you need the accessibility.

Python Multithreading/processing gains for inserts to different tables in MySQL?

I've been pouring over everywhere I can to find an answer to this, but can't seem to find anything:
I've got a batch update to a MySQL database that happens every few minutes, with Python handling the ETL work (I'm pulling data from web API's into the MySQL system).
I'm trying to get a sense of what kinds of potential impact (be it positive or negative) I'd see by using either multithreading or multiprocessing to do multiple connections & inserts of the data simultaneously. Each worker (be it thread or process) would be updating a different table from any other worker.
At the moment I'm only updating a half-dozen tables with a few thousand records each, but this needs to be scalable to dozens of tables and hundreds of thousands of records each.
Every other resource I can find out there addresses doing multithreading/processing to the same table, not a distinct table per worker. I get the impression I would definitely want to use multithreading/processing, but it seems everyone's addressing the one-table use case.
Thoughts?

I think your question is too broad to answer concisely. It seems you're asking about two separate subjects - will writing to separate MySQL tables speed it up, and is python multithreading the way to go. For the python part, since you're probably doing mostly IO, you should look at gevent, and ultramysql. As for the MySQL part, you'll have to wait for more answers.

For one I wrote in C#, I decided the best work partitioning was each "source" having a thread for extraction, one for each transform "type", and one to load the transformed data to each target.
In my case, I found multiple threads per source just ended up saturating the source server too much; it became less responsive overall (to even non-ETL queries) and the extractions didn't really finish any faster since they ended up competing with each other on the source. Since retrieving the remote extract was more time consuming than the local (in memory) transform, I was able to pipeline the extract results from all sources through one transformer thread/queue (per transform "type"). Similarly, I only had a single target to load the data to, so having multiple threads there would have just monopolized the target.
(Some details omitted/simplified for brevity, and due to poor memory.)
...but I'd think we'd need more details about what your ETL process does.

Spark and Cassandra through Python

I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?

Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()

I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.

Quickly dumping a database in memory to file

I want to take advantage of the speed benefits of holding an SQLite database (via SQLAlchemy) in memory while I go through a one-time process of inserting content, and then dump it to file, stored to be used later.
Consider a bog-standard database created in the usual way:
# in-memory database
e = create_engine('sqlite://')
Is there a quicker way of moving its contents to disc, other than just creating a brand new database and inserting each entry manually?
EDIT:
There is some doubt as to whether or not I'd even see any benefits to using an in-memory database. Unfortunately I already see a huge time difference of about 120x.
This confusion is probably due to me missing out some important detail in the question. Also probably due to a lack of understanding on my part re: caches / page sizes / etc. Allow me to elaborate:
I am running simulations of a system I have set up, with each simulation going through the following stages:
Make some queries to the database.
Make calculations / run a simulation based on the results of those queries.
insert new entries into the database based on the most recent simulation.
Make sure the database is up to date with the new entries by running commit().
While I only ever make a dozen or so insertions on each simulation run, I do however run millions of simulations, and the results of each simulation need to be available for future simulations to take place. As I say, this read and write process takes considerably longer when running a file-backed database; it's the difference between 6 hours and a month.
Hopefully this clarifies things. I can cobble together a simple python script to outline my process further a little further if necessary.

SQLAlchemy and SQLite know how to cache and do batch-inserts just fine.
There is no benefit in using an in-memory SQLite database here, because that database uses pages just like the on-disk version would, and the only difference is that eventually those pages get written to disk for disk-based database. The difference in performance is only 1.5 times, see SQLite Performance Benchmark -- why is :memory: so slow...only 1.5X as fast as disk?
There is also no way to move the in-memory database to a disk-based database at a later time, short of running queries on the in-memory database and executing batch inserts into the disk-based database on two separate connections.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.