I am currently porting some code from Spark to the MSSQL Analytical Services with Python. Everything is nice and dandy, but I am not sure if my solution is the correct one for multiple inputs for the scripts.
Consider the following code snippet:
DROP PROCEDURE IF EXISTS SampleModel;
GO
CREATE PROCEDURE SampleModel
AS
BEGIN
exec sp_execute_external_script
#language =N'Python',
#script=N'
import sys
sys.path.append("C:\path\to\custom\package")
from super_package.sample_model import run_model
OutputDataSet = run_model()'
WITH RESULT SETS ((Score float));
END
GO
INSERT INTO [dbo].[SampleModelPredictions] (prediction) EXEC [dbo].[SampleModel]
GO
I have a custom package called super_package and a sample model called sample_model. Since this model uses multiple database tables as input, and I would rather have everything in one place I have a module which connects to the database and fetches the data directly:
def go_go_get_data(query, config):
return rx_data_step(RxSqlServerData(
sql_query=query,
connection_string=config.connection_string,
user=config.user,
password=config.password))
Inside the run_model() function I fetch all necessary data from the database with the go_go_get_data function.
If the data is too big to handle in one go I would to some pagination.
In general I cannot join the tables so this solution doesn't work.
The questions is: Is this the right approach to tackle this problem? Or did I miss something? For now this works, but as I am still in the development / tryout phase I cannot be certain that this will scale. I would rather use the parameters for the stored procedure than fetching inside the Python context.
As you've already figured out, sp_execucte_external_script only allows one result set to be passed in. :-(
You can certainly query from inside the script to fetch data as long as your script is okay with the fact that it's not executing under the current SQL session's user's permissions.
If pagination is important and one data set is significantly larger than the others and you're using Enterprise Edition, you might consider passing the largest data set into the script in chunks using sp_execute_external_script's streaming feature.
If you'd like all of your data to be assembled in SQL Server (vs. fetched by queries in your script), you could try to serialize the result sets and then pass them in as parameters (link describes how to do this in R but something similar should be possible with Python).
Related
I'm migrating data from SQL Server 2017 to Postgres 10.5, i.e., all the tables, stored procedures etc.
I want to compare the data consistency between SQL Server and Postgres databases after the data migration.
All I can think of now is using Python Pandas and loading the tables into data frames from SQL Server and also Postgres and compare the data frames.
But the data is around 6 GB which takes much time for loading table into the data frame and also hosted on a server which is not local to where I'm running the Python script. Is there any way to efficiently compare the data consistency across SQL Server and Postgres?
Yes, you can order the data by primary key, and then write the data to a json or xml file.
Then you can run diff over the two files.
You can also run this chunked by primary-key, that way you don't have to work with a huge file.
Log any diff that doesn't show as equal.
If it doesn't matter what the difference is, you could also just run MD5/SHA1 on the two file chunks, and if the hash machtches, there is no difference, if it doesn't, there is.
Speaking from experience with nhibernate, what you need to watch out for is:
bit fields
text, ntext, varchar(MAX), nvarchar(MAX) fields (they map to varchar with no length, by the way - encoding UTF8)
varbinary, varbinary(MAX), image (bytea[] vs. LOB)
xml
that all primary-key's id serial generator is reset after you inserted all data in pgsql.
Another thing to watch out is which time zone CURRENT_TIMESTAMP uses.
Note:
I'd actually run System.Data.DataRowComparer directly, without writing data to a file:
static void Main(string[] args)
{
DataTable dt1 = dt1();
DataTable dt2= dt2();
IEnumerable<DataRow> idr1 = dt1.Select();
IEnumerable<DataRow> idr2 = dt2.Select();
// MyDataRowComparer MyComparer = new MyDataRowComparer();
// IEnumerable<DataRow> Results = idr1.Except(idr2, MyComparer);
IEnumerable<DataRow> results = idr1.Except(idr2);
}
Then you write all non-matching DataRows into a logfile, for each table one directory (if there are differences).
Don't know what Python uses in place of System.Data.DataRowComparer, though.
Since this would be a one-time task, you could also opt to not do it in Python, and use C# instead (see above code sample).
Also, if you had large tables, you could use DataReader with sequential access to do the comparison. But if the other way cuts it, it reduces the required work considerably.
Have you considered making your SQL Server data visible within your Postgres with a Foreign Data Wrapper (FDW)?
https://github.com/tds-fdw/tds_fdw
I haven't used this FDW tool but, overall, the basic FDW setup process is simple. An FDW acts like a proxy/alias, allowing you to access remote data as though it were housed in Postgres. The tool linked above doesn't support joins, so you would have to perform your comparisons iteratively, etc. Depending on your setup, you would have to check if performance is adequate.
Please report back!
Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP
I connect to sqlite database in Blaze using
df = bz.Data("sqlite:///<mydatabase>)
everything works fine but I do not know how to provide user-defined functions in my interaction with df.
I have a column called IP in df which is text containing IP addresses. I also have a function toSubnet (x, y) which takes in an IP address (x) in text format and return its /y subnet. For example:
out = toSubnet('1.1.1.1',24)
out
1.1.1.0/24
Now if I want to map all IPs to their /14 subnets, I use:
df.IP.map(lambda x:toSubnet(x,14),'string')
Which works when the backend is CSV. But with sqlite backend I get NotImplementedError.
What's wrong here?
NB: This doesn't tell you how to do exactly what you want, but it provides an explanation of why it doesn't work and a possible next step to get this to work with SQLite.
The problem you're running into is that it is very difficult to efficiently execute arbitrary Python code against an arbitrary SQL database.
Blaze takes user code and translates it to SQL as best it can, using SQLAlchemy, which I don't think has a way to do this.
Since nearly every database has a different way of dealing with user-defined functions (UDFs), it's quite a lot of work to build an API that allows the following:
A user to define a function in Python
Turn that pure Python function into a UDF native to the database.
That said, the Python interface to SQLite has a way to register Python functions that can be executed in a SQL statement:
https://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.create_function
There currently isn't a way to express a UDF with Blaze using the SQL backend, though this could be implemented as new expression type that allows a user to register a function via the underlying database's db API.
This might sound like a bit of an odd question - but is it possible to load data from a (in this case MySQL) table to be used in Django without the need for a model to be present?
I realise this isn't really the Django way, but given my current scenario, I don't really know how better to solve the problem.
I'm working on a site, which for one aspect makes use of a table of data which has been bought from a third party. The columns of interest are liklely to remain stable, however the structure of the table could change with subsequent updates to the data set. The table is also massive (in terms of columns) - so I'm not keen on typing out each field in the model one-by-one. I'd also like to leave the table intact - so coming up with a model which represents the set of columns I am interested in is not really an ideal solution.
Ideally, I want to have this table in a database somewhere (possibly separate to the main site database) and access its contents directly using SQL.
You can always execute raw SQL directly against the database: see the docs.
There is one feature called inspectdb in Django. for legacy databases like MySQL , it creates models automatically by inspecting your db tables. it stored in our app files as models.py. so we don't need to type all column manually.But read the documentation carefully before creating the models because it may affect the DB data ...i hope this will be useful for you.
I guess you can use any SQL library available for Python. For example : http://www.sqlalchemy.org/
You have just then to connect to your database, perform your request and use the datas at your will. I think you can't use Django without their model system, but nothing prevents you from using another library for this in parallel.
I'd like to start by asking for your opinion on how I should tackle this task, instead of simply how to structure my code.
Here is what I'm trying to do: I have a lot of data loaded into a mysql table for a large number of unique names + dates (i.e., where the date is a separate field). My goal is to be able to select a particular name (using rawinput, and perhaps in the future add a drop-down menu) and see a monthly trend, with a moving average, and perhaps other stats, for one of the fields (revenue, revenue per month, clicks, etc). What is your advice - to move this data to an excel workbook via python, or is there a way to display this information in python (with charts that compare to excel, of course)?
Thanks!
Analyze of such data (name,date) could be seen as issuing ad-hoc SQL queries to get timeseries information.
You will 'sample' your information by a date/time frame (day/week/month/year or more detailled by hour/minute) depending of how large is your dataset.
I often use such query where the date field is truncate to the sample rate, in mysql DATE_FORMAT function is cool for that (postgres and oracle use date_trunc and trunc respectivly)
What you want to see in your data is in your your WHERE conditions.
select DATE_FORMAT(date_field,'%Y-%m-%d') as day,
COUNT(*) as nb_event
FROM yourtable
WHERE name = 'specific_value_to_analyze'
GROUP BY DATE_FORMAT(date_field,'%Y-%m-%d');
execute this query and output to a csv file. You could use direct mysql commands for that, but I recommend to make a python script that execute such query, and you can use getopt options for output formatting (with or without columns headers, use different separator than default one, etc). And even you can build dynamically the query based on some options.
To plot such information, look at time series tools. If you have missing data (date that won't appears in result of such sql query) you should take care for the choice. Excel is not the correct one for that, I think (or not master enough it), but could be a start.
Personaly I found dygraph, a javascript library, really cool for time series plotting, and it can be used with a csv file as source. Careful in such configuration, due to crossdomain security constraint, the csv file and html page that display the Dygraph object should be on the same server (or whatever the security constraint of your browser want to accept).
I used to build such webapp using django, as it's my favourite web framework, where I wrap url call as this :
GET /timeserie/view/<category>/<value_to_plot>
GET /timeserie/csv/<category>/<value_to_plot>
The first url call a view that simply output a template file with a variable that reference the url to get the csv file for the Dygraph object :
<script type="text/javascript">
g3 = new Dygraph(
document.getElementById("graphdiv3"),
"{{ csv_url }}",
{
rollPeriod: 15,
showRoller: true
}
);
</script>
The second url call a view that generate the sql query and output the result as text/csv to be rendered by Dygraph.
It's "home made" could stand simple or be extended, run easily on any desktop computer, could be extended to output json format for use by others javascript libraries/framework.
Else there is tool in opensource, related to such reporting (but timeseries capabilities are often not enough for my need) like Pentaho, JasperReport, SOFA. You make the query as datasource inside a report in such tool and build a graph that output timeserie.
I found that today web technique with correct javascript library/framework is really start to be correct to challenge that old fashion of reporting by such classical BI tools and it make things interactive :-)
Your problem can be broken down into two main pieces: analyzing the data, and presenting it. I assume that you already know how to do the data analysis part, and you're wondering how to present it.
This seems like a problem that's particularly well suited to a web app. Is there a reason why you would want to avoid that?
If you're very new to web programming and programming in general, then something like web2py could be an easy way to get started. There's a simple tutorial here.
For a desktop database-heavy app, have a look at dabo. It makes things like creating views on database tables really simple. wxpython, on which it's built, also has lots of simple graphing features.