I'm trying to read a sas7bdat file from SAS (product of the SAS Institute) into Python.
Yes, I'm a aware that we could export to *.csv files, but I'm trying to avoid that as that will double the number of files we need to create.
There's good documentation for doing this in Visual Basic. Still, I want it in Python. For example, in VB you could write...
Dim cn as ADODB.Connection
Dim rs as ADODB.Recordset
obConnection.Provider = "sas.LocalProvider"
obConnection.Properties("Data Source") = "c:\MySasData"
obConnection.Open
rs.Open "work.a", cn, adOpenStatic, adLockReadOnly, adCmdTableDirect
To open your dataset.
But I can't crack the nut to make this work in python.
I can type...
import adodbapi
cnstr = 'Provider=sas.LocalProvider;c:\\MySasData'
cn = adodbap.connect(cnstr)
And a can get a cursor...
cur = cn.cur()
But beyond that, I'm stumped. I did find a cur.rs, which sounds like a recordset, but it is an object with a type of None.
Also, to preempt some alternative methods...
I do not want to create *.csv files in SAS.
The computer with Python does not have SAS installed, but does have the Providers for OLE DB installed. I know for a fact that the VB code I provided works without SAS in read-only mode. You can download these drivers here: http://support.sas.com/downloads/browse.htm?cat=64
I am not expert in SAS. Honestly, I find their tool cumbersome, confusingly documented, and slow. I noticed that there are some other products listed called "IOMProvider" and "SAS/SHARE". If there's an easier way of doing this using those ADO providers, feel free to document it. However, what I'm really looking for is a way of doing this entirely within Python with a relatively simple bit of code.
Oh, and I'm aware of Python's sas7bdat package, but we're using Python 3.3.5 and it doesn't seem to be compatible. Also, I couldn't figure out how to use it on 2.7 anyways as there's not a lot of documentation and even a question on how to use the tool, which, to this day, is unanswered. Python sas7bdat module usage
Thanks!
Didn't test it with SAS as I don't have a provider installed currently, it should go like this:
cn = adodbapi.connect(cnstr)
# print table names in current db
for table in cn.get_table_names():
print(table)
with cn.cursor() as c:
#run an SQL statement on the cursor
sql = 'select * from your_table'
c.execute(sql)
#get the results
db = c.fetchmany(5)
#print them
for rec in db:
print(rec)
cn.close()
EDIT:
Just found this http://support.sas.com/kb/30/795.html so you might need to use other provider for this method, have a look at IOM privoder (https://www.connectionstrings.com/sas-iom-provider/ , http://support.sas.com/documentation/tools/oledb/gs_iom_tasks.htm)
Related
I want to import a csv file in cassandra using python script. I don't know how
If you're looking for a simple solution, you could always use cqlsh's COPY utility.
> COPY myTable (col1, col2, col3, col4) FROM 'temp.csv' WITH HEADER=true;
I'd go with either COPY or DSBulk before building something new in Python. In fact, cqlsh uses the Python driver and is already built to handle things like paging, batch sizes, timeouts, etc.
Documentation: COPY FROM
Edit 20210903
If you're set on querying w/ CQL and processing a result set in Python, you'll want to do something like this...
The import section will look something like this:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import SimpleStatement
First establish your connection:
auth_provider = PlainTextAuthProvider(username=username, password=password)
cluster = Cluster(nodes,auth_provider=auth_provider)
session = cluster.connect()
Then build your query as a SimpleStatement.
strCQL = f"SELECT * FROM {keyspace}.{table}"
print(strCQL)
statement = SimpleStatement(strCQL,fetch_size=100)
rows = session.execute(statement)
for row in rows:
print(row)
Note that you can also print individual column values with their ordinal index on row (row[0],row[1], etc).
In the above example, I'm setting the fetch size to 100. It defaults to 5000, but if the result set is large, you'll want that to be smaller to avoid timeouts.
Link to my Git repo for reference.
You can use the DataStax Bulk Loader tool (DSBulk) to bulk load data in CSV format to a Cassandra table.
Here are some references with examples to help you get started quickly:
Blog - DSBulk Intro + Loading data
Blog - More DSBulk Loading examples
Blog - Counting records with DSBulk
Docs - Loading data examples
Answered questions - DS Community
DSBulk is open-source so it's free to use. Cheers!
TL;DR version - I need to programmatically add a password to .docx/.xlsx/.pptx files using LibreOffice and it doesn't work, and no errors are reported back either, my request to add a password is simply ignored, and a password-less version of the same file is saved.
In-depth:
I'm trying to script the ability to password-protect existing .docx/.xlsx/.pptx files using LibreOffice.
I'm using 64-bit LibreOffice 6.2.5.2 which is the latest version at the time of writing, on Windows 8.1 64-bit Professional.
Whilst I can do this manually via the UI - specifically, I open the "plain" document, do "Save As" and then tick "Save with Password", and enter the password in there, I cannot get this to work via any kind of automation. I'm been trying via Python/Uno, but to no gain. Although the code below correctly opens and saves the document, my attempt to add a password is completely ignored. Curiously, the file size shrinks from 12kb to 9kb when I do this.
Here is my code:
import socket
import uno
import sys
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
ctx = resolver.resolve( "uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext" )
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",ctx)
from com.sun.star.beans import PropertyValue
properties=[]
oDocB = desktop.loadComponentFromURL ("file:///C:/Docs/PlainDoc.docx","_blank",0, tuple(properties) )
sp=[]
sp1=PropertyValue()
sp1.Name='FilterName'
sp1.Value='MS Word 2007 XML'
sp.append(sp1)
sp2=PropertyValue()
sp2.Name='Password'
sp2.Value='secret'
sp.append(sp2)
oDocB.storeToURL("file:///C:/Docs/PasswordDoc.docx",sp)
oDocB.dispose()
I've had great results using Python/Uno to open password-protected files, but I cannot get it to protect a previously unprotected document. I've tried enabling the macro recorder and recording my actions - it recorded the following LibreOffice BASIC code:
sub SaveDoc
rem ----------------------------------------------------------------------
rem define variables
dim document as object
dim dispatcher as object
rem ----------------------------------------------------------------------
rem get access to the document
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
rem ----------------------------------------------------------------------
dim args1(2) as new com.sun.star.beans.PropertyValue
args1(0).Name = "URL"
args1(0).Value = "file:///C:/Docs/PasswordDoc.docx"
args1(1).Name = "FilterName"
args1(1).Value = "MS Word 2007 XML"
args1(2).Name = "EncryptionData"
args1(2).Value = Array(Array("OOXPassword","secret"))
dispatcher.executeDispatch(document, ".uno:SaveAs", "", 0, args1())
end sub
Even when I try to run that, it...saves an unprotected document, with no password encryption. I've even tried converting the macro above into the equivalent Python code, but to no avail either. I don't get any errors, it simply doesn't protect the document.
Finally, out of desperation, I've even tried other approaches that don't include LibreOffice, for example, using the Apache POI library as per the following existing StackOverflow question:
Python or LibreOffice Save xlsx file encrypted with password
...but I just get an error saying "Error: Could not find or load main class org.python.util.jython". I've tried upgrading my JDK, tweaking the paths used in the example, i.e. had an "intelligent" go, but still no joy. I suspect the error above is trivial to fix, but I'm not a Java developer and lack the experience in this area.
Does anyone have any solution? Do you have some LibreOffice code that can do this (password-protect .docx/.xlsx/.pptx files)? Or OpenOffice for that matter, I'm not precious about which package I use. Or something else entirely!
NOTE: I appreciate this is trivial using full-fat Microsoft Office, but thanks to Microsoft's licensing restrictions, is a complete no-go for this project - I have to use an alternative.
The following example is from page 40 (file page 56) of Useful Macro Information
For OpenOffice.org by Andrew Pitonyak (http://www.pitonyak.org/AndrewMacro.odt). The document is directed to OpenOffice.org Basic but is generally applicable to LibreOffice as well. The example differs from the macro recorder version primarily in its use of the documented API rather than dispatch calls.
5.8.3. Save a document with a password
To save a document with a password, you must set the “Password”
attribute.
Listing 5.19: Save a document using a password.
Sub SaveDocumentWithPassword
Dim args(0) As New com.sun.star.beans.PropertyValue
Dim sURL$
args(0).Name ="Password"
args(0).Value = "test"
sURL=ConvertToURL("/andrew0/home/andy/test.odt")
ThisComponent.storeToURL(sURL, args())
End Sub
The argument name is case sensitive, so “password” will not work.
Question 1 of 2
I'm trying to import data from CSV file to Vertica using Python, using Uber's vertica-python package. The problem is that whitespace-only data elements are being loaded into Vertica as NULLs; I want only empty data elements to be loaded in as NULLs, and non-empty whitespace data elements to be loaded in as whitespace instead.
For example, the following two rows of a CSV file are both loaded into the database as ('1','abc',NULL,NULL), whereas I want the second one to be loaded as ('1','abc',' ',NULL).
1,abc,,^M
1,abc, ,^M
Here is the code:
# import vertica-python package by Uber
# source: https://github.com/uber/vertica-python
import vertica_python
# write CSV file
filename = 'temp.csv'
data = <list of lists, e.g. [[1,'abc',None,'def'],[2,'b','c','d']]>
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f, escapechar='\\', doublequote=False)
writer.writerows(data)
# define query
q = "copy <table_name> (<column_names>) from stdin "\
"delimiter ',' "\
"enclosed by '\"' "\
"record terminator E'\\r' "
# copy data
conn = vertica_python.connect( host=<host>,
port=<port>,
user=<user>,
password=<password>,
database=<database>,
charset='utf8' )
cur = conn.cursor()
with open(filename, 'rb') as f:
cur.copy(q, f)
conn.close()
Question 2 of 2
Are there any other issues (e.g. character encoding) I have to watch out for using this method of loading data into Vertica? Are there any other mistakes in the code? I'm not 100% convinced it will work on all platforms (currently running on Linux; there may be record terminator issues on other platforms, for example). Any recommendations to make this code more robust would be greatly appreciated.
In addition, are there alternative methods of bulk inserting data into Vertica from Python, such as loading objects directly from Python instead of having to write them to CSV files first, without sacrificing speed? The data volume is large and the insert job as is takes a couple of hours to run.
Thank you in advance for any help you can provide!
The copy statement you have should perform the way you want with regards to the spaces. I tested it using a very similar COPY.
Edit: I missed what you were really asking with the copy, I'll leave this part in because it might still be useful for some people:
To fix the whitespace, you can change your copy statement:
copy <table_name> (FIELD1, FIELD2, MYFIELD3 AS FILLER VARCHAR(50), FIELD4, FIELD3 AS NVL(MYFIELD3,'') ) from stdin
By using filler, it will parse that into something like a variable which you can then assign to your actual table field using AS later in the copy.
As for any gotchas... I do what you have on Solaris often. The only one thing I noticed is you are setting the record terminator, not sure if this is really something you need to do depending on environment or not. I've never had to do it switching between linux, windows and solaris.
Also, one hint, this will return a resultset that will tell you how many rows were loaded. Do a fetchone() and print it out and you'll see it.
The only other thing I can recommend might be to use reject tables in case any rows reject.
You mentioned that it is a large job. You may need to increase your read timeout by adding 'read_timeout': 7200, to your connection or more. I'm not sure if None would disable the read timeout or not.
As for a faster way... if the file is accessible directly on the vertica node itself, you could just reference the file directly in the copy instead of doing a copy from stdin and have the daemon load it directly. It's much faster and has a number of optimizations that you can do. You could then use apportioned load, and if you have multiple files to load you can just reference them all together in a list of files.
It's kind of a long topic, though. If you have any specific questions let me know.
I tried researching the answer but was not able to find a good solution. I have files with strange extensions .res. I was told that they are MS Access files. Not sure if they are the same as .mdb but I was able to open them in MS Access. How can I open those files, extract necessary data, sort that data and produce .csv file? I tried using this script: http://mazamascience.com/WorkingWithData/?p=168 and mdb tools on Linux. I got some output with errors in terminal but all the files produced were blank. It could be due to encoding. I am not sure. The file is in ASCII encoding I think.
Error: Table fo_Table
Smart_Battery_Data_Table
MCell_Aci_Data_Table
Aux_Global_Data_Table
Smart_Battery_Clock_Stretch_Table
does not exist in this database.
On Windows I have no idea how to do it. My first step for now is just to dump the necessary table from that database file into .csv. But ideally I need the script to take the file, sort it, extract necessary data, do some calculations (like data in one column divided by data in another column) and save all that stuff into nice .csv.
Thanks a lot. I am not an experienced programmer so please have mercy.
Using the generic pyodbc library should do it. Looks like it has already an embedded MS access driver. This question can probably help you out.
I dont have any MS Access database files with me (It has been ages that I dont have to work with them), but following the examples your code should be something like this:
import pyodbc
db_file = r'''/path/to/the/file.res'''
user = 'admin'
password = 'password'
odbc_conn_str = 'DRIVER={Microsoft Access Driver (*.mdb)};DBQ=%s;UID=%s;PWD=%s' % (db_file, user, password)
conn = pyodbc.connect(odbc_conn_str)
cursor = conn.cursor()
cursor.execute("select * from table order by some_column")
for row in cursor.fetchall():
print ", ".join((row.column1, row.column2, row.columnN))
In a project, I need to extract data from a Visual FoxPro database, which is stored in dbf files, y have a data directory with 539 files I need to take into account, each file represents a database table, so I've been doing some testing and my code goes like this:
import pyodbc
connection = pyodbc.connect("Driver={Microsoft Visual FoxPro Driver};SourceType=DBF;SourceDB=P:\\Data;Exclusive=No;Collate=Machine;NULL=No;DELETED=Yes")
tables = connection.cursor().tables()
for _ in tables:
print _
this prints only 15 tables, with no obvious pattern, always the same 15 tables, I thought this was because the rest of the tables were empty but I checked and it some of the tables (dbf files) on the list are empty too, then, I thought it was a permission issue, but all the files have the same permission structure, so, I don't know what's happening here.
Any light??
EDIT:
It is not truccating the output, the tables it list are not the 15 first or anything like that
I DID IT!!!!
There where several problems with what I was doing so, here I come with what I did to solve it (after implementing it the first time with Ethan Furman's solution)
The first thing was a driver problem, it turns out that the Windows' DBF drivers are 32 bits programs and runs on a 64 bits operating system, so, I had installed Python-amd64 and that was the first problem, so I installed a 32bit Python.
The second issue was a library/file issue, according to this, dbf files in VFP > 7 are diferent, so my pyodbc library won't read them correctly, so I tried some OLE-DB libraries with no success and I decided to to it from scratch.
Googling for a while took me to this post which finally gave me a light on this
Basically, what I did was the following:
import win32com.client
conn = win32com.client.Dispatch('ADODB.Connection')
db = 'C:\\Profit\\profit_a\\ARMM'
dsn = 'Provider=VFPOLEDB.1;Data Source=%s' % db
conn.Open(dsn)
cmd = win32com.client.Dispatch('ADODB.Command')
cmd.ActiveConnection = conn
cmd.CommandText = "Select * from factura, reng_fac where factura.fact_num = reng_fac.fact_num AND factura.fact_num = 6099;"
rs, total = cmd.Execute() # This returns a tuple: (<RecordSet>, number_of_records)
while total:
for x in xrange(rs.Fields.Count):
print '%s --> %s' % (rs.Fields.item(x).Name, rs.Fields.item(x).Value)
rs.MoveNext() #<- Extra indent
total = total - 1
And it gave me 20 records which I checked with DBFCommander and were OK
First, you need to install pywin32 extensions (32bits) and the Visual FoxPro OLE-DB Provider (only available for 32bits), in my case for VFP 9.0
Also, it's good to read de ADO Documentation at the w3c website
This worked for me. Thank you very much to those who replied
I would use my own dbf package and the code would go something like this:
import dbf
from glob import glob
for dbf_file in glob(r'p:\data\*.dbf'):
with dbf.Table(dbf_file) as table:
for record in table:
do_something_with(record)
A table is list-like, and iteration through it returns records. A record is list-, dict-, and obj-like, and iteration returns the values; besides iteration through the record, individual fields can be accessed either by offset (record[0] for the first field), by field-name using dict-like access (record['some_field']), or by field-name using obj.attr-like access (record.some_field).
If you just wanted to dump the contents of each dbf file into a csv file you could do:
for dbf_file in glob(r'p:\data\*.dbf'):
with dbf.Table(dbf_file) as table:
dbf.export(table, dbf_file)
I know this doesn't directly answer your question, but might still help. I've had lots of issues using ODBC with VFP databases and I've found it's often much easier treating the VFP tables as free tables when possible.
Using Yusdi Santoso's dbf.py and glob, here's some code to open each table in a directory and run through each record.
import glob
import os
import dbf
os.chdir("P:\\data")
for file in glob.glob("*.dbf"):
table = dbf.readDbf(file)
for row in table:
#do stuff