I have a SSIS package that will import an excel file. I want to use a python script to run through all the column headings and replace any white spaces with a '_'.
Previously when doing this for a pandas dataframe, I'd use:
df.columns = [w.replace(' ','_') for w in list(df.columns)]
However I don't know how to reference the column headers from python. I understanding I use a 'Execute Process Task' and how to implement that into SSIS, however how can I refer to a dataset contained within the SSIS package from Python?
Your dataset won't be in SSIS. The only data that is "in" SSIS are row buffers in a Data Flow Task. There you define a source, destination and any transformation that takes place per row.
If you're going to execute a python script, the end result is that you've expressed the original Excel file in some other format. Maybe you rewrote it as a CSV, maybe you wrote it to a table, perhaps it's just written back as a new Excel file but with no whitespace in the column names.
There is no native Data Flow source that will allow you to use python directly. There is a Script Component which allows you to run anything and there is IronPython which would allows you to run IronPython in SSIS but that's not going to work for a Data Flow Task. A Data Flow Task is metadata dependent at run time. That is, before the package runs, the engine will interrogate the source and destination elements to ensure they exist, the data type of the columns is the same or bigger than the data type described in the contract that was built during design time.
In simple terms, you can't dynamically change out the shape of the data in a Data Flow Task. If you need a generic dynamic data importer, then you're writing all the logic yourself. You can still use SSIS as the execution framework as it has nice logging, management, etc but your SSIS package is going to be a mostly .NET project.
So, with all of that said, I think the next challenge you'll run into if you try to use IronPython with Pandas is that they don't work together. At least, not well enough that the expressed desire "a column rename" is worth the effort and maintenance headache you'd have.
There is an option to execute sp_execute_external_script with python script in a Data Flow and use it as a source. You can also save it to CSV or excel file and read it in SSIS.
Related
In a general case, is there a way in VSC that each time we launch the debugger we pre-load the session with objects coming from a file? For example, dataframes? Because each time each launches it has no objects at all and needs the running code to start creating the objects.
So the challenge is when we already have some files on hard-drive with heavy objects, but when we launch the debugg I can't tell it to just pre-load those objects stored in those files, say dataframes or lists or connection objects or whatever.
To be more specific. If there's a python code that has two sections of code:
Section 1: Code we know works perfectly and makes heavy calculations to create some
objects
Section 2: Code that takes those objects and performs operations. We want to debug this section. We also know no code line in this section interacts in any way with the code or stacks of Section1. It simply takes the objects created from section 1
Example: Section 1 queries an enormous table and puts it as a dataframe, sorts it, filters, etc... Then Section 2 just needs that dataframe and makes some statistics.
Is there a way that we just launch from section 2 and we load that dataframe we have stored in a csv? Or do we need to launch always from section 1, run the connection, get the dataframe from a query (which takes a lot of time) and then finally arrive to section 2 to start debugging?
Note. I could just make a .py file having section 2 code, and hard-coding on it at the begging to just read the csv I have. But is there a fancier way to do this without having to make another .py file for debugging and manually writing code to it, and then debugging that .py file?
The question is: Launch VSC python debugger telling it to load python objects from files in folders, rather than launching the session with no objects. Waiting for the code to create them
There is no way to convert csv files to python objects before debugging since all Python objects are in-memory.
If you don't want to set them in your code, I would suggest using an environment variable for it, and set it by adding "env" in your launch.json.
I have a question and hope someone can direct me in the right direction; Basically every week I have to run a query (SSMS) to get a table containing some information (date, clientnumber, clientID, orderid etc) and then I copy all the information and that table and past it in a folder as a CSV file. it takes me about 15 min to do all this but I am just thinking can I automate this, if yes how can I do that and also can I schedule it so it can run by itself every week. I believe we live in a technological era and this should be done without human input; so I hope I can find someone here willing to show me how to do it using Python.
Many thanks for considering my request.
This should be pretty simple to automate:
Use some database adapter which can work with your database, for MSSQL the one delivered by pyodbc will be fine,
Within the script, connect to the database, perform the query, parse an output,
Save parsed output to a .csv file (you can use csv Python module),
Run the script as the periodic task using cron/schtask if you work on Linux/Windows respectively.
Please note that your question is too broad, and shows no research effort.
You will find that Python can do the tasks you desire.
There are many different ways to interact with SQL servers, depending on your implementation. I suggest you learn Python+SQL using the built-in sqlite3 library. You will want to save your query as a string, and pass it into an SQL connection manager of your choice; this depends on your server setup, there are many different SQL packages for Python.
You can use pandas for parsing the data, and saving it to a ~.csv file (literally called to_csv).
Python does have many libraries for scheduling tasks, but I suggest you hold off for a while. Develop your code in a way that it can be run manually, which will still be much faster/easier than without Python. Once you know your code works, you can easily implement a scheduler. The downside is that your program will always need to be running, and you will need to keep checking to see if it is running. Personally, I would keep it restricted to manually running the script; you could compile to an ~.exe and bind to a hotkey if you need the accessibility.
I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!
You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python
I have a C++ based application logging data to files and I want to load that data in Python so I can explore it. The data files are flat files with a known number of records per file. The data records are represented as a struct (nested structs) in my C++ application. This struct (subtructs) change regularly during my development process, so I also have to make associated changes to my Python code that loads the data. This is obviously tedious and doesn't scale well. What I am interested in is a way to automate the process of updating the Python code (or some other way to handle this problem altogether). I am exploring some libraries that convert my C++ structs to other formats such as JSON, but I have yet to find a solid solution. Can anyone suggest something?
Consider using data serialization system / format that has C++ and Python bindings: https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats
(e.g. protobuf or even json or csv)
Alternatively consider writing a library in C that reads the data end exposes them as structures. Then use: https://docs.python.org/3.7/library/ctypes.html to call this C library and retrieve records
Of course if semantics of the data changes (e.g. new important field needs to by analyzed) you will have to handle that new stuff in the python code. No free lunch.
I have spent a few days orientating myself to the spss and spssaux modules - which are great resources. Though I feel like I am missing some conceptual understanding because I can do basic things like retrieve value labels via spssaux.getValueLabels or spss.DataStep()
print spssaux.getValueLabels(2)
>>> {u'1': u'Neutral', u'0': u'Disagree', u'2': u'Agree'}
or
dataset = spssDataset()
variable_list = dataset.varList
print variable_list[2].valueLabels.data
>>> {0.0: u'Disagree', 1.0: u'Neutral', 2.0: u'Agree'}
However, I'm struggling to figure out how to retrieve the actual data values.
I'm also having trouble figuring out how to retrieve the values from analyses and to use them in Python. At the moment I have been running analyses using spss.Submit(), but I suspect this is limited in terms of feeding values back to Python (i.e., feeding back means and significance values to Python, which can be then used in Python to make decisions).
If you have any suggestions for ideas, please note that I need to be operating within the Python environment as this data retrieval/analyses is incorporated into a broader Python program.
Thanks!
The spss.Cursor class is a low level class that is rather hard to use. The spssdata.Spssdata class provides a much friendlier interface. You can also use the spss.Dataset class, which was modeled after Spssdata and has additional capabilities but is slower.
For retrieving Viewer output, the basic workhorse is OMS writing to the xml workspace or to new datasets. You can use some functions in the spssaux module that wrap this. createDatasetOuput simplifies creating datasets from tables. createXmlOutput and the companion getValuesFromXmlWorkspace use the xml workspace. Underneath the latter, the spss.EvaluateXPath api lets you pluck whatever piece of output you want from a table.
Also, if you are basically living in a Python world, have you discovered external mode? This lets you run Statistics from an external Python program. You can use your Python IDE to work interactively in the Python code and debug. You just import the spss module and whatever else you need and use the provided apis as needed. In external mode, however, there is no Viewer, so you can't use the SpssClient module apis.
See the spss.Cursor class in the Python reference guide for SPSS. It is hard to give general advice about your workflow, but if you are producing stats in SPSS files you can then grab them for use in Python programs. Here is one example:
*Make some fake data.
DATA LIST FREE / ID X.
BEGIN DATA
1 5
2 6
3 7
END DATA.
DATASET NAME Orig.
BEGIN PROGRAM Python.
import spss, spssdata
alldata = spssdata.Spssdata().fetchall()
print alldata
#this just grabs all of the data
END PROGRAM.
*Make your mean in SPSS syntax.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK
/MeanX = MEAN(X).
BEGIN PROGRAM Python.
var = ["MeanX"]
alldata2 = spssdata.Spssdata(var).fetchone()
print alldata2
#This just grabs the mean of the variable you created
END PROGRAM.