I have a C++ based application logging data to files and I want to load that data in Python so I can explore it. The data files are flat files with a known number of records per file. The data records are represented as a struct (nested structs) in my C++ application. This struct (subtructs) change regularly during my development process, so I also have to make associated changes to my Python code that loads the data. This is obviously tedious and doesn't scale well. What I am interested in is a way to automate the process of updating the Python code (or some other way to handle this problem altogether). I am exploring some libraries that convert my C++ structs to other formats such as JSON, but I have yet to find a solid solution. Can anyone suggest something?
Consider using data serialization system / format that has C++ and Python bindings: https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats
(e.g. protobuf or even json or csv)
Alternatively consider writing a library in C that reads the data end exposes them as structures. Then use: https://docs.python.org/3.7/library/ctypes.html to call this C library and retrieve records
Of course if semantics of the data changes (e.g. new important field needs to by analyzed) you will have to handle that new stuff in the python code. No free lunch.
Related
I am having a hard time with this. Is there a way to get a compiled protocol buffer file’s (pb2.py) contents into excel?
Your question lacks detail and a demonstration of your attempt to solve this problem, it is likely that it will be closed.
Presumably (!?) your intent is to start with serialized binary/wire format protocol buffer messages, unmarshal these into Python objects and then, using a Python package (list that can interact with Excel, enter these objects as row into Excel.
The Python (pb2.pby) file generated by the protocol buffer compiler (protoc) from a .proto file, contains everything you need to marshal and unmarshal messages in the binary/wire format to Python objects that represent the messages etc. that are defined by the .proto file. The protocol buffer documentation is comprehensive and explains this well (link).
Once you've unmarshaled the data into one or more Python objects, you will need to use the Python package for Excel of your choosing to output these objects into the spreadsheet(s).
It is unclear whether you have flat or hierarchical data. If you have anything non-trivial, you'll also need to decide how to represent the structure in the spreadsheet's table-oriented structure.
I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!
You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python
I have a SSIS package that will import an excel file. I want to use a python script to run through all the column headings and replace any white spaces with a '_'.
Previously when doing this for a pandas dataframe, I'd use:
df.columns = [w.replace(' ','_') for w in list(df.columns)]
However I don't know how to reference the column headers from python. I understanding I use a 'Execute Process Task' and how to implement that into SSIS, however how can I refer to a dataset contained within the SSIS package from Python?
Your dataset won't be in SSIS. The only data that is "in" SSIS are row buffers in a Data Flow Task. There you define a source, destination and any transformation that takes place per row.
If you're going to execute a python script, the end result is that you've expressed the original Excel file in some other format. Maybe you rewrote it as a CSV, maybe you wrote it to a table, perhaps it's just written back as a new Excel file but with no whitespace in the column names.
There is no native Data Flow source that will allow you to use python directly. There is a Script Component which allows you to run anything and there is IronPython which would allows you to run IronPython in SSIS but that's not going to work for a Data Flow Task. A Data Flow Task is metadata dependent at run time. That is, before the package runs, the engine will interrogate the source and destination elements to ensure they exist, the data type of the columns is the same or bigger than the data type described in the contract that was built during design time.
In simple terms, you can't dynamically change out the shape of the data in a Data Flow Task. If you need a generic dynamic data importer, then you're writing all the logic yourself. You can still use SSIS as the execution framework as it has nice logging, management, etc but your SSIS package is going to be a mostly .NET project.
So, with all of that said, I think the next challenge you'll run into if you try to use IronPython with Pandas is that they don't work together. At least, not well enough that the expressed desire "a column rename" is worth the effort and maintenance headache you'd have.
There is an option to execute sp_execute_external_script with python script in a Data Flow and use it as a source. You can also save it to CSV or excel file and read it in SSIS.
I have two programs. My main program is in python and I have another one in c++ because I have to connect to a machine and the only way I can use this machine is a dll/lib made with c++ that the constructor gave me.
In my python, I'm getting from a database thousands points and I'm doing some operations on them and those points are stocked in differents array in a class.
Now, I need to send thoses points to the c++ program and I was wondering how. I can send them as parameters to the main of the c++ program but the command line is then too long and I think even if it wasn't too long, this would be a really bad practice (If someone can explain a little why, that would be cool!).
Actually, I'm thinkinng to generate a file with all the coordonates in this file in my python program and send it too the c++ but I have to parse it then in c++.
Is it a good way to do it? Or maybe an other solution would be better?
Some ways:
Use subprocess.Popen() with stdin=subprocess.PIPE and write the data one per line, parsing it on the C++ side
Use temporary files, either text files or other files
Use ctypes and load the C++ part as shared library and then call into the C++ functions from Python directly
Use Python's struct module to encode your data from Python into C structs, which avoids the string formatting and string parsing -- this can be combined with any of the three above: subprocess.Popen (read binary data from stdin in C++), temporary files (open in binary mode) or ctypes (pass a pointer to the struct directly to the library function)
As to the question why passing arguments on the command line is problematic, there are limits imposed by the OS, e.g. getconf ARG_MAX.
I have spent a few days orientating myself to the spss and spssaux modules - which are great resources. Though I feel like I am missing some conceptual understanding because I can do basic things like retrieve value labels via spssaux.getValueLabels or spss.DataStep()
print spssaux.getValueLabels(2)
>>> {u'1': u'Neutral', u'0': u'Disagree', u'2': u'Agree'}
or
dataset = spssDataset()
variable_list = dataset.varList
print variable_list[2].valueLabels.data
>>> {0.0: u'Disagree', 1.0: u'Neutral', 2.0: u'Agree'}
However, I'm struggling to figure out how to retrieve the actual data values.
I'm also having trouble figuring out how to retrieve the values from analyses and to use them in Python. At the moment I have been running analyses using spss.Submit(), but I suspect this is limited in terms of feeding values back to Python (i.e., feeding back means and significance values to Python, which can be then used in Python to make decisions).
If you have any suggestions for ideas, please note that I need to be operating within the Python environment as this data retrieval/analyses is incorporated into a broader Python program.
Thanks!
The spss.Cursor class is a low level class that is rather hard to use. The spssdata.Spssdata class provides a much friendlier interface. You can also use the spss.Dataset class, which was modeled after Spssdata and has additional capabilities but is slower.
For retrieving Viewer output, the basic workhorse is OMS writing to the xml workspace or to new datasets. You can use some functions in the spssaux module that wrap this. createDatasetOuput simplifies creating datasets from tables. createXmlOutput and the companion getValuesFromXmlWorkspace use the xml workspace. Underneath the latter, the spss.EvaluateXPath api lets you pluck whatever piece of output you want from a table.
Also, if you are basically living in a Python world, have you discovered external mode? This lets you run Statistics from an external Python program. You can use your Python IDE to work interactively in the Python code and debug. You just import the spss module and whatever else you need and use the provided apis as needed. In external mode, however, there is no Viewer, so you can't use the SpssClient module apis.
See the spss.Cursor class in the Python reference guide for SPSS. It is hard to give general advice about your workflow, but if you are producing stats in SPSS files you can then grab them for use in Python programs. Here is one example:
*Make some fake data.
DATA LIST FREE / ID X.
BEGIN DATA
1 5
2 6
3 7
END DATA.
DATASET NAME Orig.
BEGIN PROGRAM Python.
import spss, spssdata
alldata = spssdata.Spssdata().fetchall()
print alldata
#this just grabs all of the data
END PROGRAM.
*Make your mean in SPSS syntax.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK
/MeanX = MEAN(X).
BEGIN PROGRAM Python.
var = ["MeanX"]
alldata2 = spssdata.Spssdata(var).fetchone()
print alldata2
#This just grabs the mean of the variable you created
END PROGRAM.