SAS and Web Data

SAS and Web Data - python

I have been taking a few graduate classes with a professor I like alot and she raves about SAS all of the time. I "grew up" learning stats using SPSS, and with their recent decisions to integrate their stats engine with R and Python, I find it difficult to muster up the desire to learn anything else. I am not that strong in Python, but I can get by with most tasks that I want to accomplish.
Admittedly, I do see the upside to SAS, but I have learned to do some pretty cool things combining SPSS and Python, like grabbing data from the web and analyzing it real-time. Plus, I really like that I can use the GUI to generate the base for my code before I add my final modifications. In SAS, it looks like I would have to program everything by hand (ignoring Enterprise Guide).
My question is this. Can you grab data from the web and parse it into SAS datasets? This is a deal-breaker for me. What about interfacing with API's like Google Analytics, Twitter, etc? Are there external IDE's that you can use to write and execute SAS programs?
Any help will be greatly appreciated.
Brock

Incidentally, SAS is now offering integration with R.
http://support.sas.com/rnd/app/studio/Rinterface2.html
There are all sorts of ways to get data off the web. One example is to use the url access methods on filename statements to pull in xml data off the web.
For example:
filename cmap "yldmap.map"; /* an xml map I created to parse the data */
filename curyld
url "http://www.ustreas.gov/offices/domestic-finance/debt-management/interest-rate/yield.xml";
libname curyld xml xmlmap=cmap;

yes. sas 9.2 can interact with soap and restful apis. i haven't had much success with twitter. i have had some success with google spreadsheets (in sas 9.1.3) and i've seen code to pull google analytics (in sas 9.2).
as with python and r, you can write the code in any text editor, but you'll need to have sas to actually execute it. lately, i've been bouncing between eclipse, pspad, and sas's enhanced editor for writing code, but i always have to submit in sas.

Related

Running a Python Script on a Website (in the background)

Firstly, apologies for the very basic question. I have looked into other answers but they haven't quite answered what I'm after. I'm confident designing a site in HTML/CSS and have very very basic knowledge of Python.
I want to run a very basic Python script on my website. It analyses tweets about a specific topic, and then posts a sentiment analysis score. I want it to run this sentiment analysis every hour and cache the score.
I have a working Python script which does this in Jupyter Notebook. Could you give me an overview of how I would make this script function online and cache the results? I've read into using Python web frameworks, but from my limited understanding, they seem like overkill?
Thank you for your help!

Could you give me an overview of how I would make this script function online
The key thing would be to uncouple the two parts of your system:
Producing the data
Showing it in a website.
So the first thing to do is have your sentiment-analysis script push its value to a database. The database could be something as simple as a csv file, or it could be a key/value store, or something like MySQL or CouchDB (or hundreds of other choices).
Over on the website you have to make a decision between:
Server-side
Client-side
If the former, you could program in Python if that is what you are most familiar with. Whatever language/framework combination you go for, there will an example tutorial of how to read a value from a database and display it: it is just about the most fundamental thing.
If client-side you will usually be programming in JavaScript. Again you need to choose a framework, but again you should easily be able to find a tutorial to follow.
(Unless you have a good reason to prefer server-side, such as familiarity with an existing framework, or security issues with accessing your database, I'd go with a client-side approach.)
I've read into using Python web frameworks... overkill?
Yes and no. You are going to need some kind of database, and some kind of framework. It would be good to understand the basics of web security, too. If the sentiment analysis is your major goal, all that is going to be a distraction, and it might be better to find a friend who already knows web programming to work with. Or just find a tutorial that is very close to what you want to do, and adapt that.
(P.S. I was going to flag your question as "too broad", but you did ask for an overview, so I hope this helps.)

Automating IBM SPSS Data Collection survey export?

I'm so sorry for the vague question here, but I'm hoping an SPSS expert will be able to help me out here. We have some surveys that are done via SPSS, from which we extract data for an internal report. Right now the process is very cumbersome and requires going to the SPSS Data Collection Interviewer Server Administration page and manually exporting data from two different projects (which takes hours at a time!). We then take that data, massage it, and upload it to another database that drives the internal report.
My question is, does anyone out there know how to automate this process? Is there a SQL Server database behind the SPSS data? Where does the .mdd file come in to play? Can my team (who is well-versed in extracting data from various sources) tap into the SQL Server database behind SPSS to get our data? Or do we need some sort of Python script and plugin?
If I'm missing information that would be helpful in answering the question, please let me know. I'm happy to provide it; I just don't know what to provide.
Thanks so much.

As mentioned by other contributors, there are a few ways to achieve this. The simplest I can suggest is using the DMS (data management script) and windows scheduler. Ideally you should follow below steps.
Prerequisite:
1. You should have access to the server running IBM Data collection
2. Basic knowledge of windows task scheduler
3. Knowledge of DMS scripting
Approach:
1. Create a new DMS script from the template
2. If you want to perform only data extract / transformation, you only need input and output data source
3. In the input data source, create/build the connection string pointing to your survey on IBM Data collection server. Use the data source as SQL
4. In the select query: use "Select * from VDATA" if you want to export all variables
5. Set the output data connection string by selecting the output data format as SPSS (if you want to export it in SPSS)
6. run the script manually and see if the SPSS export is what is expected
7. Create batch file using text editor (save with .bat extension). Add below lines
cd "C:\Program Files\IBM\SPSS\DataCollection\6\DDL\Scripts\Data Management\DMS"
Call DMSRun YOURDMSFILENAME.dms
Then add a line to copy (using XCOPY) the data / files extracted to the location where you want to further process it.
Save the file and open windows scheduler to schedule the execution of this batch file for data extraction.
If you want to do any further processing, you create an mrs or dms file and add to the batch file.
Hope this helps!

There are a number of different ways you can accomplish easing this task and even automate it completely. However, if you are not an IBM SPSS Data Collection expert and don't have access to somebody who is or have the time to become one, I'd suggest getting in touch with some of the consultants who offer services on the platform. Internally IBM doesn't have many skilled SPSS resources available, so they rely heavily on external partners to do services on a lot of their products. This goes for IBM SPSS Data Collection in particular, but is also largely true for SPSS Statistics.
As noted by previous contributors there is an approach using Python for data cleaning, merging and other transformations and then loading that output into your report database. For maintenance reasons I'd probably not suggest this approach. Though you are most likely able to automate the export of data from SPSS Data Collection to a sav file with a simple SPSS Syntax (and an SPSS add-on data component), it is extremely error prone when upgrading either SPSS Statistics or SPSS Data Collection.
From a best practice standpoint, you ought to use the SPSS Data Collection Data Management module. It is very flexible and hardly requires any maintenance on upgrades, because you are working within the same data model framework (e.g. survey metadata, survey versions, labels etc. is handled implicitly) right until you load your transformed data into your reporting database.
Ideally the approach would be to build the mentioned SPSS Data Collection Data Management script and trigger it at the end of each completed interview. In this way your reporting will be close to real-time (you can make it actual real-time by triggering the DM script during the interview using the interview script events - just a FYI).
All scripting on the SPSS Data Collection platform including Data Management scripting is very VB-like, so for most people knowing VB, it is very easy to get started and it is documented very well in the SPSS Data Collection DDL. There you'll also be able to find examples of extracting survey data from SPSS Data Collection surveys (as well as reading and writing data to/from other databases, files etc.). There are also many examples of data manipulation and transformation.
Lastly, to answer your specific questions:
Yes, there is always an MS SQL Server behind SPSS Data Collection -
no exceptions. However, generally speaking the data model is way to
complex to read out data directly from it. If you have a look in it,
you'll quickly realize this.
The MDD file (short for Meta Data Document) is containing all survey meta
data including data source specifications, version history etc.
Without it you'll not be able to make anything of the survey data in
the database, which is the main reason I'd suggest to stay within the
SPSS Data Collection platform for as large part of your data handling
as possible. However, it is indeed just a readable XML file.
Note that the SPSS Data Collection Data Management Module requires a separate license and if the scripting needed is large or complex, you'd probably want base professional too, if that's not what you already use for developing the questionnaires and handling the surveys.
Hope that helps.

This isn't as clean as working directly with whatever database is holding the data, but you could do something with an exported data set:
There may or may not be a way for you to write and run an export script from inside your Admin panel or whatever. If not, you could write a simple Python script using Selenium WebDriver which logs into your admin panel and exports all data to a *.sav data file.
Then you can use the Python SPSS extensions to write your analysis scripts. Note that these scripts have to run on a machine that has a copy of SPSS installed.
Once you have your data and analysis results accessible to Python, you should be able to easily write that to your other database.

How to extract serialised data from Access

I want to enable a user to export some data to a web application I am building. The data from the legacy application can be accessed through MS Acces (ODBC). The web application is written in Django/Python, but that is not very relevant.
The user would have to export data from time to time and import it into the web app. The table structure in the web app more-or-less mirrors the one in the legacy application.
My question of how to get the data from Access to a format that is easily parseable in the web app. The data is from 5 different tables and interrelated. Is there a way to serialise the data from Access into an XML / JSON file? I know that you can do an XML export, but as far as I know that is limited to a query, so I wouldn't have the hierarchy... Is there a VBA library to help with the task?

You can reference Microsoft XML, v5.0 (or whatever version) in the Visual Basic Editor and create XML programmatically.
See
- Simple example
- Introduction to XML in Microsoft Windows (in depth example)

Answering my own question here. I did some googling and it looks like you can export data from a table together with selected other tables. For that, it is necessary to draw the relationships within Access.
That might also solve my problem (and without composing the XML manually). Will find out if this works and check back later.
source: http://msdn.microsoft.com/en-us/library/office/aa167823(v=office.11).aspx#odc_accessnewxmlfeatures_includingrelatedtableswhenexportingxml

Python scripting for XBMC

I am new to programming and to Python itself. I have no programming experience. I have managed to read up on Python and done some fairly basic Python tutorial, now I am ready for my first project in Python.
I am basing my project around XBMC, I want to develop some addons for this awesome media center.
I have a few websites that I want to scrape and display in XBMC. One is a music website and one is a payed TV website which is only available to people with accounts with them. I have managed to scrape a website with feedparse but I have no idea how to output these titles and links to play in XBMC.
My question here is: where do I start, how do I construct the script for these websites, what tools/libraries/modules do I need. And what do I need to do to include it into XBMC.

On the general topic that has been asked a ton of times regarding webpage scraping, the common answer is always Mechanize/Beautiful Soup for python. That would allow you to actually get your data.
Once you have your data, its then just a matter of formatting it the way you want, for your xbmc app: http://wiki.xbmc.org/index.php?title=HOW-TO:Write_Python_Scripts_for_XBMC
Its a two step process.
Get your data from a source and format it into some common structure
Use the common structure to populate your elements in the xbmc script
What you actually want to do with your script will determine how you would use your data. If its just simply providing information, then that link above would pretty much explain it.

Mining Wikipedia for mapping relations for text mining

I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.
Thank You

If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:
http://en.wikipedia.org/wiki/DBpedia
You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.
UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/
As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.

You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.
NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org

Jvc, there are existing python modules that can do everything you mentioned above.
For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).
However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.
Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.
So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.