Importing data in SPSS syntax incl. 'value labels' and 'var labels' - python

I am trying to setup a standard work flow to efficiently import data from the Dutch National Bureau of Statistics (http://statline.cbs.nl) in SPSS syntax into R and /or Python so I can do analyses, load it into our database etc.
The good news is that they have standardized a lot of different output formats, amongst others an .sps syntax file. In essence, this is a space-delimited data file with extra information contained in the header and in the footer. The file looks like shown below. I prefer to use this format than plain .csv because it contains more data and should make it easier to import large amounts of data in a consistent manner.
The bad news is that I can't find a working library in Python and/or R that can deal with .sps SPPS syntax files. Most libraries work with the binary .sav or .por formats.
I am not looking for a full working SPSS clone, but something that will parse the data correctly using the meta-data with the keywords 'DATA LIST' (length of each column, 'VAR LABELS' (the column headers) and 'VALUE LABELS' (extra data should be joined/replaced during the import).
I'm sure a Python/R library could be written to parse and process all this info efficiently, but I am not that fluent/experienced in either language to do it myself.
Any suggestions or hints would be helpful
SET DECIMAL = DOT.
TITLE "Gezondheidsmonitor; regio, 2012, bevolking van 19 jaar of ouder".
DATA LIST RECORDS = 1
/1 Key0 1 - 5 (A)
Key1 7 - 7 (A)
Key2 9 - 14 (A)
Key3 16 - 23 (A)
Key4 25 - 28 (A)
Key5 30 - 33 (A)
Key6 35 - 38 (A)
Key7 40 - 43 (A).
BEGIN DATA
80200 1 GM1680 2012JJ00 . . . .
80200 1 GM0738 2012JJ00 13.2 . . 21.2
80200 1 GM0358 2012JJ00 . . . .
80200 1 GM0197 2012JJ00 13.7 . . 10.8
80200 1 GM0059 2012JJ00 12.4 . . 16.5
80200 1 GM0482 2012JJ00 13.3 . . 14.1
80200 1 GM0613 2012JJ00 11.6 . . 16.2
80200 1 GM0361 2012JJ00 17.0 9.6 17.1 14.9
80200 1 GM0141 2012JJ00 . . . .
80200 1 GM0034 2012JJ00 14.3 18.7 22.5 18.3
80200 1 GM0484 2012JJ00 9.7 . . 15.5
(...)
80200 3 GM0642 2012JJ00 15.6 . . 19.6
80200 3 GM0193 2012JJ00 . . . .
END DATA.
VAR LABELS
Key0 "Leeftijd"/
Key1 "Cijfersoort"/
Key2 "Regio's"/
Key3 "Perioden"/
Key4 "Mantelzorger"/
Key5 "Zwaar belaste mantelzorgers"/
Key6 "Uren mantelzorg per week"/
Key7 "Ernstig overgewicht".
VALUE LABELS
Key0 "80200" "65 jaar of ouder"/
Key1 "1" "Percentages"
"2" "Ondergrens"
"3" "Bovengrens"/
Key2 "GM1680" "Aa en Hunze"
"GM0738" "Aalburg"
"GM0358" "Aalsmeer"
"GM0197" "Aalten"
(...)
"GM1896" "Zwartewaterland"
"GM0642" "Zwijndrecht"
"GM0193" "Zwolle"/
Key3 "2012JJ00" "2012".
LIST /CASES TO 10.
SAVE /OUTFILE "Gezondheidsmonitor__regio,_2012,_bevolking_van_19_jaar_of_ouder.SAV".

Some sample code to get you started - sorry not the best Python programmer here.. so any improvements might be welcome.
Steps to add here is a method to load the labels and create a list of dicts for the LABEL VALUES.....
f = open('Bevolking_per_maand__100214211711.sps','r')
#lines = f.readlines()
spss_keys = list()
data = list()
begin_data_step= False
end_data_step = False
for l in f:
# first look for TITLE
if l.find('TITLE') <> -1:
start_pos=l.find('"')+1
end_pos = l.find('"',start_pos+1)
title = l[start_pos:end_pos]
print "title:" ,title
if l.find('DATA LIST') <> -1:
data_list = True
start_pos=l.find('=')+1
end_pos=len(l)
num_records= l[start_pos:end_pos].strip()
print "number of records =", num_records
if num_records=='1':
if ((l.find("Key") <> -1) and (not begin_data_step) and (not end_data_step)):
spss_keys.append([l[15:22].strip(),int(l[23:29].strip()),int(l[32:36].strip()),l[37:].strip()])
if l.find('END DATA.') <> -1:
end_data_step=True
if ((begin_data_step) and (not end_data_step)):
values = list()
for key in spss_keys:
values.append(l[key[1]-1:key[2]])
data.append(values)
if l[-1]=="." :
begin_data=False
if l.find('BEGIN DATA') <> -1:
begin_data_step=True
if end_data_step:
print ""
# more to follow
data

From my point of view I would not bother with the SPSS file option, but choose the HTML version and scrape it down. It looks the tables are nicely formatted with classes which would make scraping/parsing the HTML much easier....
Another question to be answered should be: are you going to download the files manually or would you also like to do that automatically?

Related

Best way to parse this file to access the tab separated table at the end?

I have about 15,000 text files with the following format (sample to download is below):
https://easyupload.io/res1so
The part I am interested is the table at the end that looks like:
1 1 GLY HA2 H 3.55 . 2
2 1 GLY HA3 H 3.76 . 2
3 2 VAL H H 8.52 . 1
4 2 VAL HA H 4.20 . 1
5 2 VAL HB H 2.02 . 1
I don't have a lot of experience in parsing files, but figured a lot of the people here would. Can I get some advice on how to programmatically extract just this part of the file?
For example, is there a way to read the file only between the lines:
_Chem_shift_ambiguity_code
and
_stop
Would the best approach be to use regular expressions to search each line with the readline() method until I have reached the appropriate part, and then toggle 'on' something that continually appends the lines to a pandas dataframe?
Thank you in advance!
A simple way for this kind of parsing is to use a boolean variable to record whether we are inside the block to process or not and toggle it when we find the keyword:
with open(filename) as fd:
inblock = False
for line in fd:
if inblock: # we are inside the block here
if len(line.strip()) == 0:
continue # ignore blank lines inside block
elif 'stop_' in line:
inblock = False
break # stop processing the file
else: # ok, we can process that line
...
else: # still waiting for the initial keyword
if '_Chem_shift_ambiguity_code':
inblock = True # ok we have found the beginning of the block

HTML parsing using beautiful soup gives structure different to website

When I view this link https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm the text is displayed in a clear way. However when I try to parse the page using beautiful soup I am outputting something which doesn't look the same - it is all messed up. Here is the code
import urllib.request
from bs4 import BeautifulSoup
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()
print(text)
The desired ouput would look like this
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Traders in Financial Futures - Futures Only Positions as of June 16, 2015
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Dealer : Asset Manager/ : Leveraged : Other : Nonreportable :
Intermediary : Institutional : Funds : Reportables : Positions :
Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short :
-----------------------------------------------------------------------------------------------------------------------------------------------------------
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE ($100 X INDEX)
CFTC Code #221602 Open Interest is 19,721
Positions
97 2,934 0 8,941 1,574 973 6,490 11,975 1,694 1,372 539 0 154 32
Changes from: June 9, 2015 Total Change is: 3,505
48 0 0 2,013 1,141 70 447 1,369 923 -64 0 0 68 2
Percent of Open Interest Represented by Each Category of Trader
0.5 14.9 0.0 45.3 8.0 4.9 32.9 60.7 8.6 7.0 2.7 0.0 0.8 0.2
Number of Traders in Each Category Total Traders: 31
. . 0 5 . . 6 9 . 5 . 0
-----------------------------------------------------------------------------------------------------------------------------------------------------------
After viewing the page source it is not clear to me how a new line is being distingushed in the style - which is where I think the problem comes from.
Is there some type of structure I need to specify in the BeautifulSoup function? I'm very lost here, so any help is much appreciated.
Fwiw I have installing the html2text module and had no luck installing on anaconda using !conda config --append channels conda-forge and !conda install html2text
Cheers
EDIT: ive figured it out. im a brainlet
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')
cleaned = []
for i in htm:
i = BeautifulSoup(i,'html.parser' ).get_text()
cleaned.append(i)
with open('trouble.txt','w') as f:
for line in cleaned:
f.write('%s\n' % line)

Adding in-between columns, skipping and keeping some rows/columns

I am new to programming but I have started looking into both Python and Perl.
I am looking for data in two input files that are partly CSV, selecting some of them and putting into a new output file.
Maybe Python CSV or Pandas can help here, but I'm a bit stuck when it comes to skipping/keeping rows and columns.
Also, I don't have any headers for my columns.
Input file 1:
-- Some comments
KW1
'Z1' 'F' 30 26 'S'
KW2
'Z1' 30 26 1 1 5 7 /
'Z1' 30 26 2 2 6 8 /
'Z1' 29 27 4 4 12 13 /
Input file 2:
-- Some comments
-- Some more comments
KW1
'Z2' 'F' 40 45 'S'
KW2
'Z2' 40 45 1 1 10 10 /
'Z2' 41 45 2 2 14 15 /
'Z2' 41 46 4 4 16 17 /
Desired output file:
KW_NEW
'Z_NEW' 1000 30 26 1 /
'Z_NEW' 1000 30 26 2 /
'Z_NEW' 1000 29 27 4 /
'Z_NEW' 1000 40 45 1 /
'Z_NEW' 1000 41 45 2 /
'Z_NEW' 1000 41 46 4 /
So what I want to do is:
Do not include anything in either of my two input files before I reach KW2
Replace KW2 with KW_NEW
Replace either Z1' orZ2withZ_NEW` in the first column
Add a new second column with a constant value e.g. 1000
Copy the next three columns as they are
Leave out any remaining columns before printing the slash / at the end
Could anyone give me at least some general hints/tips how to approach this?
Your files are not "partly csv" (there is not a comma in sight); they are (partly) space delimited. You can read the files line-by-line, use Python's .split() method to convert the relevant strings into lists of substrings, and then re-arrange the pieces as you please. The splitting and re-assembly might look something like this:
input_line = "'Z1' 30 26 1 1 5 7 /" # test data
input_items = input_line.split()
output_items = ["'Z_NEW'", '1000']
output_items.append(input_items[1])
output_items.append(input_items[2])
output_items.append(input_items[3])
output_items.append('/')
output_line = ' '.join(output_items)
print(output_line)
The final print() statement shows that the resulting string is
'Z_NEW' 1000 30 26 1 /
Is your file format static? (this is not actually csv by the way :P) You might want to investigate a standardized file format like JSON or strict CSV to store your data, so that you can use already-existing tools to parse your input files. python has great JSON and CSV libraries that can do all the hard stuff for you.
If you're stuck with this file format, I would try something along these lines.
path = '<input_path>'
kws = ['KW1', 'KW2']
desired_kw = kws[1]
def parse_columns(line):
array = line.split()
if array[-1] is '/':
# get rid of trailing slash
array = array[:-1]
def is_kw(cols):
if len(cols) > 0 and cols[0] in kws:
return cols[0]
# to parse the section denoted by desired keyword
with open(path, 'r') as input_fp:
matrix = []
reading_file = False
for line in input_fp.readlines:
cols = parse_columns(line)
line_is_kw = is_kw(line)
if line_is_kw:
if not reading_file:
if line_is_kw is desired_kw:
reading_file = True
else:
continue
else:
break
if reading_file:
matrix = cols
print matrix
From there you can use stuff like slice notation and basic list manipulation to get your desired array. Good luck!
Here is a way to do it with Perl:
#!/usr/bin/perl
use strict;
use warnings;
# initialize output array
my #output = ('KW_NEW');
# proceed first file
open my $fh1, '<', 'in1.txt' or die "unable to open file1: $!";
while(<$fh1>) {
# consider only lines after KW2
if (/KW2/ .. eof) {
# Don't treat KW2 line
next if /KW2/;
# split the current line on space and keep only the fifth first element
my #l = (split ' ', $_)[0..4];
# change the first element
$l[0] = 'Z_NEW';
# insert 1000 at second position
splice #l,1,0,1000;
# push into output array
push #output, "#l";
}
}
# proceed second file
open my $fh2, '<', 'in2.txt' or die "unable to open file2: $!";
while(<$fh2>) {
if (/KW2/ .. eof) {
next if /KW2/;
my #l = (split ' ', $_)[0..4];
$l[0] = 'Z_NEW';
splice #l,1,0,1000;
push #output, "#l";
}
}
# write array to output file
open my $fh3, '>', 'out.txt' or die "unable to open file3: $!";
print $fh3 $_,"\n" for #output;

python print particular lines from file

The background:
Table$Gene=Gene1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.928 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 2208 40 0.755 0.00803 0.739 0.771
5 2256 48 0.769 0.00787 0.754 0.784
6 2208 40 0.755 0.00803 0.739 0.771
Table$Gene=Gene2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.938 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 1000 40 0.744 0.00803 0.739 0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.
What I want seems simple. I want to turn the above file into an output that looks like this:
Gene1 0.755
Gene2 0.744
i.e. each gene, and the last number in the survival column from each section.
I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". One example of code that I have tried:
fileopen = open(sys.argv[1]).readlines() # Read in the file as a list.
for index,line in enumerate(fileopen): # Enumerate items in list
if "Table" in line: # Find the items with "Table" (This will have my gene name)
line2 = line.split("=")[1] # Parse line to get my gene name
if "\n" in fileopen[index+1]: # This is the problem section.
print fileopen[index]
else:
fileopen[index+1]
So as you can see in the problem section, I was trying to say in this attempt:
if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want).
If anyone could correct the code so I can see what I did wrong I'd appreciate it.
Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. Just need to write a bit of code to specify the relevant lines in the file. Un-optimized code (reading file twice):
import pandas as pd
def genetable(gene):
l = open('gene.txt').readlines()
l += "\n" # add newline to end of file in case last line is not newline
lines = len(l)
skiprows = -1
for (i, line) in enumerate(l):
if "Table$Gene=Gene"+str(gene) in line:
skiprows = i+1
if skiprows>=0 and line=="\n":
skipfooter = lines - i - 1
df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
# assuming tab separated data given your inputs. change as needed
# assert df.columns.....
return df
return "Not Found"
this will read in a DataFrame with all the relevant data in that file
can then do:
genetable(2).survival # series with all survival rates
genetable(2).survival.iloc[-1] last item in survival
The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. If my own code i would add assertions on column names before returning the pandas DataFrame. Want to pick up any errors in parsing early so that it does not propagate.
This worked when I tried it:
gene = 1
for i in range(len(filelines)):
if filelines[i].strip() == "":
print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
gene += 1
You could try something like this (I copied your data into foo.dat);
In [1]: with open('foo.dat') as input:
...: lines = input.readlines()
...:
Using with makes sure the file is closed after reading.
In [3]: lines = [ln.strip() for ln in lines]
This gets rid of extra whitespace.
In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]
In [6]: startgenes
Out[6]: [0, 10]
In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]
In [8]: emptylines
Out[8]: [9, 17]
Using emptylines relies on the fact that the records are separated by lines containing only whitespace.
In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]
In [10]: for first, last in zip(startgenes, lastlines):
....: gene = lines[first].split("=")[1]
....: num = lines[last].split()[-1]
....: print gene, num
....:
Gene1 0.771
Gene2 0.774
here is my solution:
>>> with open('t.txt','r') as f:
... for l in f:
... if "Table" in l:
... gene = l.split("=")[1][:-1]
... elif l not in ['\n', '\r\n']:
... surv = l.split()[3]
... else:
... print gene, surv
...
Gene1 0.755
Gene2 0.744
Instead of checking for new line, simply print when you are done reading the file
lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
if "Table" in line:
if table != "": # print previous survival
print table, finalsurvival
table = line.strip().split('=')[1]
else:
try:
finalsurvival = line.split('\t')[4]
except IndexError:
continue
print table, finalsurvival

DTML: How to prevent formatting loss

I have a DTML document which only contains:
<dtml-var public_blast_results>
and displays when i view it as:
YP_001336283 100.00 345 0 0 23 367 23 367 0.0 688
When I edit the DTML page for example just adding a header like:
<h3>Header</h3>
<dtml-var public_blast_results>
The "public_blast_results" loeses its formatting and displayes as:
Header
YP_001336283 100.00 345 0 0 23 367 23 367 0.0 688
Is there a way for maintaining the formatting? public_blast_results is a python function which just simply reads the contents of a file and returns it.
This is nothing to do with DTML - it's a basic issue with HTML, which is that it ignores whitespace. If you want to preserve it, you need to wrap the content with <pre>.
<pre><dtml-var public_blast_results></pre>

Categories

Resources