ETL using Python

ETL using Python - python

I am working on a data warehouse and looking for an ETL solution that uses Python.
I have played with SnapLogic as an ETL, but I was wondering if there were any other solutions out there.
This data warehouse is just getting started. Ihave not brought any data over yet. It will easily be over 100 gigs with the initial subset of data I want to load into it.

Yes. Just write Python using a DB-API interface to your database.
Most ETL programs provide fancy "high-level languages" or drag-and-drop GUI's that don't help much.
Python is just as expressive and just as easy to work with.
Eschew obfuscation. Just use plain-old Python.
We do it every day and we're very, very pleased with the results. It's simple, clear and effective.

You can use pyodbc a library python provides to extract data from various Database Sources. And than use pandas dataframes to manipulate and clean the data as per the organizational needs. And than pyodbc to load it to your data warehouse.

Related

Importing data from Excel to REDCap using API by Python

I am new in the Python world but have a reasonable understanding at a basic level. I would appreciate it if someone could share a guide on how to import data from Excel to REDCap using API Python. The data which I have are medical-related like patient name, age, comorbidities, ... etc.

Approach this in two steps.
Use something like pandas.read_excel()` to get the data under control of Python.
Then use the PyCap package's import_records() to write the records to the REDCap server.
(In future SO posts, include more details so the code can be more tailored. I know it's tricky when PHI is involved and a fake dataset must be used on SO.)

Swamped with real time data and tasked with building a database

I work for a power company, and have been tasked with building a database. I have a pretty beginner/intermediate understanding level of python, and can fuddle decently with MSSQL. They have procured Azure for this project, and I am completely lost of how to start this task.
Here is one of the sources of data that I want to scrape every minute.
http://ets.aeso.ca/ets_web/docroot/tradingPage.html - this is a complete overview of the Alberta power market in real time.
Ideally, I would want to be able to scrape this data and other sources, and then modify it to fit into in a certain format and push it onto the SQL server.
Do I need virtual machines that are just looping over python scripts? Or do I need managed instances? This data also then needs to be able to be queried right after it is scraped. Eventually this data may feed machine learning algorithms (I don't know jack about that either but I have been told it should play friendly with that type of enviornment).
Just looking to see if anyone has any insight in how you would approach this, and can tell me what I clearly don't know and haven't thought of. Any insight is truly appreciated.
Thanks!

How to improve the write speed to sql database using python

I'm trying to find a better way to push data to sql db using python. I have tried
dataframe.to_sql() method and cursor.fast_executemany()
but they don't seem to increase the speed with that data(the data is in csv files) i'm working with right now. Someone suggested that i could use named tuples and generators to load data much faster than pandas can do.
[Generally the csv files are atleast 1GB in size and it takes around 10-17 minutes to push one file]
I'm fairly new to much of concepts of python,so please suggest some method or atleast a reference any article that shows any info. Thanks in advance

If you are trying to insert the csv as is into the database (i.e. without doing any processing in pandas), you could use sqlalchemy in python to execute a "BULK INSERT [params, file, etc.]". Alternatively, I've found that reading the csvs, processing, writing to csv, and then bulk inserting can be an option.
Otherwise, feel free to specify a bit more what you want to accomplish, how you need to process the data before inserting to the db, etc.

Automating IBM SPSS Data Collection survey export?

I'm so sorry for the vague question here, but I'm hoping an SPSS expert will be able to help me out here. We have some surveys that are done via SPSS, from which we extract data for an internal report. Right now the process is very cumbersome and requires going to the SPSS Data Collection Interviewer Server Administration page and manually exporting data from two different projects (which takes hours at a time!). We then take that data, massage it, and upload it to another database that drives the internal report.
My question is, does anyone out there know how to automate this process? Is there a SQL Server database behind the SPSS data? Where does the .mdd file come in to play? Can my team (who is well-versed in extracting data from various sources) tap into the SQL Server database behind SPSS to get our data? Or do we need some sort of Python script and plugin?
If I'm missing information that would be helpful in answering the question, please let me know. I'm happy to provide it; I just don't know what to provide.
Thanks so much.

As mentioned by other contributors, there are a few ways to achieve this. The simplest I can suggest is using the DMS (data management script) and windows scheduler. Ideally you should follow below steps.
Prerequisite:
1. You should have access to the server running IBM Data collection
2. Basic knowledge of windows task scheduler
3. Knowledge of DMS scripting
Approach:
1. Create a new DMS script from the template
2. If you want to perform only data extract / transformation, you only need input and output data source
3. In the input data source, create/build the connection string pointing to your survey on IBM Data collection server. Use the data source as SQL
4. In the select query: use "Select * from VDATA" if you want to export all variables
5. Set the output data connection string by selecting the output data format as SPSS (if you want to export it in SPSS)
6. run the script manually and see if the SPSS export is what is expected
7. Create batch file using text editor (save with .bat extension). Add below lines
cd "C:\Program Files\IBM\SPSS\DataCollection\6\DDL\Scripts\Data Management\DMS"
Call DMSRun YOURDMSFILENAME.dms
Then add a line to copy (using XCOPY) the data / files extracted to the location where you want to further process it.
Save the file and open windows scheduler to schedule the execution of this batch file for data extraction.
If you want to do any further processing, you create an mrs or dms file and add to the batch file.
Hope this helps!

There are a number of different ways you can accomplish easing this task and even automate it completely. However, if you are not an IBM SPSS Data Collection expert and don't have access to somebody who is or have the time to become one, I'd suggest getting in touch with some of the consultants who offer services on the platform. Internally IBM doesn't have many skilled SPSS resources available, so they rely heavily on external partners to do services on a lot of their products. This goes for IBM SPSS Data Collection in particular, but is also largely true for SPSS Statistics.
As noted by previous contributors there is an approach using Python for data cleaning, merging and other transformations and then loading that output into your report database. For maintenance reasons I'd probably not suggest this approach. Though you are most likely able to automate the export of data from SPSS Data Collection to a sav file with a simple SPSS Syntax (and an SPSS add-on data component), it is extremely error prone when upgrading either SPSS Statistics or SPSS Data Collection.
From a best practice standpoint, you ought to use the SPSS Data Collection Data Management module. It is very flexible and hardly requires any maintenance on upgrades, because you are working within the same data model framework (e.g. survey metadata, survey versions, labels etc. is handled implicitly) right until you load your transformed data into your reporting database.
Ideally the approach would be to build the mentioned SPSS Data Collection Data Management script and trigger it at the end of each completed interview. In this way your reporting will be close to real-time (you can make it actual real-time by triggering the DM script during the interview using the interview script events - just a FYI).
All scripting on the SPSS Data Collection platform including Data Management scripting is very VB-like, so for most people knowing VB, it is very easy to get started and it is documented very well in the SPSS Data Collection DDL. There you'll also be able to find examples of extracting survey data from SPSS Data Collection surveys (as well as reading and writing data to/from other databases, files etc.). There are also many examples of data manipulation and transformation.
Lastly, to answer your specific questions:
Yes, there is always an MS SQL Server behind SPSS Data Collection -
no exceptions. However, generally speaking the data model is way to
complex to read out data directly from it. If you have a look in it,
you'll quickly realize this.
The MDD file (short for Meta Data Document) is containing all survey meta
data including data source specifications, version history etc.
Without it you'll not be able to make anything of the survey data in
the database, which is the main reason I'd suggest to stay within the
SPSS Data Collection platform for as large part of your data handling
as possible. However, it is indeed just a readable XML file.
Note that the SPSS Data Collection Data Management Module requires a separate license and if the scripting needed is large or complex, you'd probably want base professional too, if that's not what you already use for developing the questionnaires and handling the surveys.
Hope that helps.

This isn't as clean as working directly with whatever database is holding the data, but you could do something with an exported data set:
There may or may not be a way for you to write and run an export script from inside your Admin panel or whatever. If not, you could write a simple Python script using Selenium WebDriver which logs into your admin panel and exports all data to a *.sav data file.
Then you can use the Python SPSS extensions to write your analysis scripts. Note that these scripts have to run on a machine that has a copy of SPSS installed.
Once you have your data and analysis results accessible to Python, you should be able to easily write that to your other database.

Write Microsoft Word Doc with MySQL data

I'm programming a MySQL database with Web interface for remote access. I used Django as a framework. But now, I want to generate some reports using the MySQL data and modify them after generating. Therefore, I automatically think of exporting data to or importing from Word. The thing is, how I do this?
I have seen several options. One of them, using Python-docx, a library to generate docx documents in Python. I could have a problem with this, because the generated reports will be large, with lots of images, tables, pages, etc. I worked with xlsxwriter, and when the files were large it took long time to generate de xlsx. I don't know if Python-docx would be the better solution.
Other option is to import data directly from Microsoft Word, using some software for this concrete purpose or using a macro VBA. I have programmed some example code with VBA to import data of MySQL using connectors ODBC and it's immediately possible, but there is thousand of objects and classes of VBA Word to learn.
Exposed the problem, any tips or suggestions??? Thanks in advance!

Another option is to generate HTML & open as a word document.
If you take a document similar to what you want to generate & save as HTML you will see what word does. Take this file as a template for your documents

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.