We are trying to create an Azure ML web-service that will receive a (.csv) data file, do some processing, and return two similar files. The Python support recently added to the azure ML platform was very helpful and we were able to successfully port our code, run it in experiment mode and publish the web-service.
Using the "batch processing" API, we are now able to direct a file from blob-storage to the service and get the desired output. However, run-time for small files (a few KB) is significantly slower than on a local machine, and more importantly, the process seems to never return for slightly larger input data files (40MB). Processing time on my local machine for the same file is under 1 minute.
My question is if you can see anything we are doing wrong, or if there is a way to get this to speed up. Here is the DAG representation of the experiment:
Is this the way the experiment should be set up?
It looks like the problem was with processing of a timestamp column in the input table. The successful workaround was to explicitly force the column to be processed as string values, using the "Metadata Editor" block. The final model now looks like this:
Related
Currently, the approach I take is,
clearing the rows in the table using python,
fetching the output of the view , in python and storing the result in a df
appending the data to the table using df.to_Sql in python.
Scheduling this script to be run every day at a specified time( prefect ).
I find this method, unappealing is because of the following reasons:
This method external , hence it involves latency.
This method is subject to various dependencies, like the sql connector that I am using for python, the scheduler like prefect, where debugging can get tricky If I have more than 10 tables..
Is there a better way/ package / tool to automate the process with least dependencies and latency ?
Have you tried Prefect 2 already? Regarding the load process, you may consider loading data to a temp table and merging from there -- by doing that in SQL, it might be faster and easier to troubleshoot. dbt is also a tool you can consider, and you can orchestrate dbt with prefect using the prefect-dbt package: https://github.com/PrefectHQ/prefect-dbt
I have a question and hope someone can direct me in the right direction; Basically every week I have to run a query (SSMS) to get a table containing some information (date, clientnumber, clientID, orderid etc) and then I copy all the information and that table and past it in a folder as a CSV file. it takes me about 15 min to do all this but I am just thinking can I automate this, if yes how can I do that and also can I schedule it so it can run by itself every week. I believe we live in a technological era and this should be done without human input; so I hope I can find someone here willing to show me how to do it using Python.
Many thanks for considering my request.
This should be pretty simple to automate:
Use some database adapter which can work with your database, for MSSQL the one delivered by pyodbc will be fine,
Within the script, connect to the database, perform the query, parse an output,
Save parsed output to a .csv file (you can use csv Python module),
Run the script as the periodic task using cron/schtask if you work on Linux/Windows respectively.
Please note that your question is too broad, and shows no research effort.
You will find that Python can do the tasks you desire.
There are many different ways to interact with SQL servers, depending on your implementation. I suggest you learn Python+SQL using the built-in sqlite3 library. You will want to save your query as a string, and pass it into an SQL connection manager of your choice; this depends on your server setup, there are many different SQL packages for Python.
You can use pandas for parsing the data, and saving it to a ~.csv file (literally called to_csv).
Python does have many libraries for scheduling tasks, but I suggest you hold off for a while. Develop your code in a way that it can be run manually, which will still be much faster/easier than without Python. Once you know your code works, you can easily implement a scheduler. The downside is that your program will always need to be running, and you will need to keep checking to see if it is running. Personally, I would keep it restricted to manually running the script; you could compile to an ~.exe and bind to a hotkey if you need the accessibility.
I have a question on the general strategy of how to integrate data into an MSSQL database.
Currently, I use python for my whole ETL process. I use it to clean, transform, and integrate the data in an MSSQL database. My data is small so I think this process works fine for now.
However, I think it a little awkward for my code to constantly read data and write data to the database. I think this strategy will be an issue once I'm dealing with large amount of data and the constant read/write seems very inefficient. However, I don't know enough to know if this is a real problem or not.
I want to know if this is a feasible approach or should I switch entirely to SSIS to handle it. SSIS to me is clunky and I'd prefer not to re-write my entire code. Any input on the general ETL architecture would be very helpful.
Is this practice alright? Maybe?
There are too many factors to give a definitive answer. Conceptually, what you're doing - Extract data from source, Transform it, Load it to destination, ETL, is all that SSIS does. It likely can do things more efficiently than python - at least I've had a devil of a time getting a bulk load to work with memory mapped data. Dump to disk and bulk insert that via python - no problem. But, if the existing process works, then let it go until it doesn't work.
If your team knows Python, introducing SSIS just to do ETL is likely going to be a bigger maintenance cost than scaling up your existing approach. On the other hand, if it's standard-ish Python + libraries and you're on SQL Server 2017+, you might be able to execute your scripts from within the database itself via sp_execute_external_script
If the ETL process runs on the same box as the database, then ensure you have sufficient resources to support both processes at their maximum observed levels of activity. If the ETL runs elsewhere, then you'll want to ensure you have fast, full duplex connectivity between the database server and the processing box.
Stand up a load testing environment that parallels production's resources. Dummy up a 10x increase in source data and observe how the ETL fares. 100x, 1000x. At some point, you'll identify what development sins you committed that do not scale and then you're poised to ask a really good, detailed question describing the current architecture, the specific code that does not perform well under load and how one can reproduce this load.
The above design considerations will hold true for Python, SSIS or any other ETL solution - prepackaged or bespoke.
I have a SSIS package that will import an excel file. I want to use a python script to run through all the column headings and replace any white spaces with a '_'.
Previously when doing this for a pandas dataframe, I'd use:
df.columns = [w.replace(' ','_') for w in list(df.columns)]
However I don't know how to reference the column headers from python. I understanding I use a 'Execute Process Task' and how to implement that into SSIS, however how can I refer to a dataset contained within the SSIS package from Python?
Your dataset won't be in SSIS. The only data that is "in" SSIS are row buffers in a Data Flow Task. There you define a source, destination and any transformation that takes place per row.
If you're going to execute a python script, the end result is that you've expressed the original Excel file in some other format. Maybe you rewrote it as a CSV, maybe you wrote it to a table, perhaps it's just written back as a new Excel file but with no whitespace in the column names.
There is no native Data Flow source that will allow you to use python directly. There is a Script Component which allows you to run anything and there is IronPython which would allows you to run IronPython in SSIS but that's not going to work for a Data Flow Task. A Data Flow Task is metadata dependent at run time. That is, before the package runs, the engine will interrogate the source and destination elements to ensure they exist, the data type of the columns is the same or bigger than the data type described in the contract that was built during design time.
In simple terms, you can't dynamically change out the shape of the data in a Data Flow Task. If you need a generic dynamic data importer, then you're writing all the logic yourself. You can still use SSIS as the execution framework as it has nice logging, management, etc but your SSIS package is going to be a mostly .NET project.
So, with all of that said, I think the next challenge you'll run into if you try to use IronPython with Pandas is that they don't work together. At least, not well enough that the expressed desire "a column rename" is worth the effort and maintenance headache you'd have.
There is an option to execute sp_execute_external_script with python script in a Data Flow and use it as a source. You can also save it to CSV or excel file and read it in SSIS.
I wish to export from multiple nodes log files (in my case apache access and error logs) and aggregate that data in batch, as a scheduled job. I have seen multiple solutions that work with streaming data (i.e think scribe). I would like a tool that gives me the flexibility to define the destination. This requirement comes from the fact that I want to use HDFS as the destination.
I have not been able to find a tool that supports this in batch. Before re-creating the wheel I wanted to ask the StackOverflow community for their input.
If a solution exists already in python that would be even better.
we use http://mergelog.sourceforge.net/ to merge all our apache logs..
take a look at Zomhg, its an aggregation/reporting system for log files using Hbase and Hdfs: http://github.com/zohmg/zohmg
Scribe can meet your requirements, there's a version (link) of scribe that can aggregate logs from multiple sources, and after reaching given threshold it stores everything in HDFS. I've used it and it works very well. Compilation is quite complicated, so if you have any problems ask a question.
PiCloud may help.
The PiCloud Platform gives you the freedom to develop your algorithms
and software without sinking time into all of the plumbing that comes
with provisioning, managing, and maintaining servers.