Is there any way to import SPSS dataset into Python, preferably NumPy recarray format?
I have looked around but could not find any answer.
Joon
SPSS has an extensive integration with Python, but that is meant to be used with SPSS (now known as IBM SPSS Statistics). There is an SPSS ODBC driver that could be used with Python ODBC support to read a sav file.
Option 1
As rkbarney pointed out, there is the Python savReaderWriter available via pypi. I've run into two issues:
It relies on a lot of extra libraries beyond the seemingly pure-python implementation. SPSS files are read and written in nearly every case by the IBM provided SPSS I/O modules. These modules differ by platform and in my experience "pip install savReaderWriter" doesn't get them running out of the box (on OS X).
Development on savReaderWriter is, while not dead, less up-to-date than one might hope. This complicates the first issue. It relies on some deprecated packages to increase speed and gives some warnings any time you import savReaderWriter if they're not available. Not a huge issue today but it could be trouble in the future as IBM continues to update the SPSS I/O modules to deal new SPSS formats (they're on version 21 or 22 already if memory serves).
Option 2
I've chosen to use R as a middle-man. Using rpy2, I set up a simple function to read the file into an R data frame and output it again as a CSV file which I subsequently import into python. It's a bit rube-goldberg but it works. Of course, this requires R which may also be a hassle to install in your environment (and has different binaries for different platforms).
gretl claims to import SPSS and export in a variety of formats, as does the R statistical suite. I've never dealt with SPSS data so cannot speak to their relative merits.
You could have Python make an external call to spssread, a Perl script that outputs the content of SPSS files in the way you want.
Maybe this will help:
Python reader + writer for spss sav files (Linux, Mac & Windows)
http://code.activestate.com/recipes/577811-python-reader-writer-for-spss-sav-files-linux-mac-/
To be clear, the SPSS ODBC driver does not require an SPSS installation.
Maybe this will be helpful for someone:
http://sourceforge.net/search/?q=python+SPSS
good luck!
Michal
Related
I am sorry if this question has already been answered, however it is a topic very little discussed about.
I am trying to run a macro in libreoffice. The macro has been written in python as shown.
import uno, os.path, sys
import pandas
def Bring_from_doc():
doc = XSCRIPTCONTEXT.getDocument()
siz=doc.Sheets
uno, os and sys can be imported without any issue since they are installed in libreoffice python installed folder.
However pandas is not installed and got this error when running script:
This is the directory where libreoffice python libraries are located including uno, os and sys. But pandas and other wanted are not.
My question is: How can be installed pandas and any other required library that can be used by any python script run by libreoffice in a macro?
Thank you!!
On Linux, this is easy. Simply install the library in the system python and it will work from a LibreOffice macro. Verify whether it is python 2 or 3 on your system; for example on Ubuntu, normally I enter python3 as the executable name.
On Windows, this is nearly impossible and I would not recommend it (and have repeatedly stated this on stackoverflow, as it is asked here every so often). Numerous environmental variables must be set correctly and other hacks are required. If you are an expert then it can be done, but since you are asking here, my guess is that you do not have the required skills to make this process go smoothly!
Even if you do get it to work, no one else will be able to use your macro, and you would need to do it all over again if you use a different computer. So I would not recommend it in most cases even for experts.
Alternatives:
Run Linux in a virtual machine.
Install a normal distribution of python elsewhere on your system, and write a normal python script that imports pandas, saving the result for example to an xml file. When that script is finished, run a LibreOffice macro that reads the result file and does not import pandas.
Avoid using specialized libraries at all. This is what I normally do, as the standard python libraries allow you to do quite a bit, maybe not as easily, but you could import some extra code or write workarounds to do what is needed.
I know similar questions have been popular in the past, but none refers to my problem. I'm looking for a way to read data from Excel file in Python, but I'm strongly against using non-builtin modules.
The reason why is that in my case Python is a component of another software, so incorporating additional module would require from every user knowledge about how to use pip, which Python installation on your pc should one install module into, etc. The solution must not require any additional actions from user.
I can read CSV files with Python builtin easily, so that could work, but how can I convert Excel to CSV in the first place? Or is there a way to read Excel directly?
Edit: It is Python 2, that is used in this software.
Edit2:
Anyone minds explaining the downvote? I think this isn't a question about a ready solution or module, but rather a method and is well detailed. It is not always possible to use external modules, so this is an actual problem. If it is not possible at all though, then I would simply expect an answer instead of -1.
Not really the prettiest solution, but you could download the complete code repository of one of the excel handling packages for python (openpyxl for example) and put these files in the same directory as the python script that you're going to run. Subsequently you can do an import of these local package files in your script.
Note: if the excel handling package has dependencies on other packages, then you'll need to download these as well.
Let me start off by saying my python knowledge is beginner-to-intermediate level, and I recently started using the language again after a long time.
The Goal:
This morning I came across a bunch of word documents I wanted to convert and concatenate to PDF files, with 2 .doc files creating one PDF.
seemed like a fairly trivial task, so I figured I'd try to learn how to do it in python.
concatenating PDFs wasn't too bad, I found PyPDF2 and managed to write a script that did just that.
But 7 hours later, after countless scripts with broken dependencies- I still can't find a way to automate the doc-pdf conversion.
The Problem(s):
every script I found either:
uses python-docx (my documents are word 2003 .docs)
uses unoconv bridge (which I installed along with OpenOffice, then searched around for documentation but found none- thus I have no idea how to call from a python script or the shell. I saw one example for this but it keeps throwing errors)
uses win32com or win32com.client or pywin32 or somesuch.
I ran into numerous issues with these- installed one but couldn't import it from code (as happened to the guy here), now I can't even find them with pip. searched for documentation for them (are they modules or classes? I have no idea) and found practically nothing that I could understand, beyond that they're connected to ActivePython. (which is apparantly a superset of Python with more capabilities?).
Uses comtypes, which I installed but was unable to use/import either for some reason (maybe I'm using pip wrong somehow?)
I know my question is hardly focused but honestly by now my brain is fried from information overload. any simplifications for a noob would be more than welcome.
TL;DR:
assuming no knowledge of COM stuff and little experience with any external frameworks:
what would I have to do to convert Word 2003 .doc files to .pdf files? I'm running python3.5.1 32-bit on a Windows 10 64-bit machine.
where can I learn more about accessing other software APIs from python? are there big prerequisites for this stuff like knowing how the OS works on a lower level?
Thanks!
From my experience, converting between the various office formats is best done outside of python. With the subprocess module, you can call the external command
soffice --convert-to pdf file.doc --headless
where soffice is the command that comes with LibreOffice.
I know this is possible to do using additional libraries such as win32com or python-pptx, but I wasn wondering if anyone knew of a way to insert an image into a powerpoint slide using the standard libraries. Lots of googling has indicated that the best solution is probably win32com, but since I can guarantee that every system this script will be deployed to will have win32com, I am looking for an implemention leveraging libraries all systems with a standard python 2.7 install will have.
It is probably possible to modify a .pptx file with the standard library without much effort: these new generation of files are meant to be zip-compressed XML + external images files, and can be handled by ziplib and standard xml parsers.
Legacy .ppt files however are a binary closed format, with little documentation, and hundrededs of corner cases. It would alwasys "be possible" to change them, since they are still just bytes, but it would take considerable effort.
That said, starting with Python 3.4, the Python installer "PIP" comes default with the language install: probably the best way to go would be to script the installation of external libraries based on the built-in PIP - that way one would not have to all external library usage.
in the early nineties I bought the tawk (Thompson awk) compiler and developed since than a lot of programs for my companies. The compiler produces fast reliable code and has a lot of useful extensions for the Windows environment.
Until now it worked in the W95, W2K and XP without problems but now that I have to move to W7 / 2008 Server I am in doubt if it is wise to try to continue with this although excellent but outdated and no more supported product.
My questions to you :
What can you recommend for real-world business applications (all of them run in batch mode - no GUI) ?
Has someone made a bigger transition (manual reprogramming) from xxx (here: awk) to Python ?
What Python implementation should I use ? I need fast file I/O and extensive random access to 100.000+ dictionary elements for 1.5 Mio monthly transactions
Which is the most stable version ? 2.7.x ? 3.1.x ?
Does 3.1 support Windows Automation ? I have to drive the Excel API through COM and need access to MS-SQL
And : is Python really the choice for this kind of task ?
Thank you for your honorable answersMeiki
Python is a good choice for these types of tasks. You should use Python 2.7.2 and since you are on Windows, you may want to use the Activestate Python distribution http://www.activestate.com/activepython/downloads which is standard Python bundled with a number of additional useful libraries and an easy to use package manager named PyPm.
Also, you should have a look at the slide presentations here http://www.dabeaz.com/generators/ and here http://www.dabeaz.com/generators-uk/index.html because Python generators are a powerful way to handle the same types of batch processing that AWK is used for.
As for Windows automation, the Activestate distro for Windows includes this, or you can download and install pywin separately if you are using the Python.org distro. I've used Python and COM to extract data from Word documents, Excel spreadsheets, Outlook mailboxes and Lotus Notes databases among other things.
If you want to stick with the awk style of doing things, you can write some Python helper functions so that your Python programs don't look so foreign to awk eyes. In fact, pyawk.py may already be all that you need http://pyawk.sourceforge.net/ You can download it here http://sourceforge.net/projects/pyawk/files/pyawk/pyawk-0.4/ however be warned that Python has evolved a lot since it was last updated.
Without question this is the best way to add Tawk to Django/python. It solved all my needs.
https://github.com/CleitonDeLima/django-tawkto