Data manipulation by users

Data manipulation by users - python

I've got a Django Web app that (potentially) processes millions of records per user. In a nutshell, users upload files which are mapped to database fields/tables and the data is ultimately loaded into one of 5 MySQL tables. I'm using the awesome DataTables library to display the data back to the user.
In many instances data will be loaded into the app and later discovered to be incorrect in the source file. For example, 500K records might be loaded, but in 400 records the first and last names were transposed. Users are able to manipulate/modify a record at a time, but it's not reasonable to expect them to change 400 records manually.
Clearly this can be solved by allowing them to delete the 400 offending records and reloading a fixed source file, but this gets to the root of my problem. How do you allow users to selectively delete (or modify) 400 records from a list of 500K based on some condition? I'm essentially asking, "how do I allow my users to execute restricted, but arbitrary, SQL against my database without having something horrible go wrong?"
I know I could construct some kind of "SQL builder" in the Web app, but that approach seems... wrong somehow. Is there something like a seriously restricted phpMyAdmin or SQL Buddy I could expose to my users? I've searched for Django apps along those lines, but I've come up with nothing. I suppose I could offer them some keyword search/filter and then allow them to delete anything that meets the criteria.
Anyone out there tackled this problem and have some guidance? I'm genuinely stumped as to the best approach.

Who are you're users, and how much do you trust them?
If your trust is low, you will need to expose some sort of API.
If your trust is high, take suggestions for an admin command such as "transpose names" and give them access to that table in Django admin. then they can select batches of records to apply this command against.

Related

Suggestions on how my end users can extract data using my SQL query without having to use SQL?

The problem is that I've ended up with the extra role of having to run these queries for different users everyday because no one in our factory knows how to use SQL and my IT wouldn't grant them database access anyway.
I've consulted some tech savvier friends who have suggested that i create a GUI (using python) that does the querying for me? That sounds like it would be ideal since all my end users would have to do is key into an input field someone on the interface and the query would run, versus them querying for the data directly and potentially messing up my SQL script.
I've read online as well of the possibility of doing something like mentioned before but in Excel instead. Right now I'm just really lost as to where to start. I probably would need to do alot of reading up anyway but it'd be at least nice to know what is a good direction to start. Any new suggestions would be appreciated.

This could be overkill for you, but Metabase is basically a GUI you can pretty easily set up on top of your SQL database.
It's target audience is "business analytics", so it enables people to create "questions" (aka. queries) using a GUI rather than writing SQL (although it allows raw SQL). More relevant to you, you probably just want to preconfigure some "questions" your users can log in and run in Metabase.
Without writing any code at all you could set up Metabase, write the query you want using either raw SQL or their GUI, and parameterize the query with some user_id or something the users can fill in. From there, you could give the users access to your Metabase (via the web), where they can go in and use the GUI to fill in whatever they need to and then run the query you've created for them.
Metabase also has a lot of customization, so you can make sure they only have permissions to see certain data, etc., and they don't need DB credentials. You, the Metabase admin, connect the database when you set up Metabase and give other users Metabase logins instead.

Is it possible to combine two pre-existing databases into one front-end?

I am new at working with databases, and I have been given the task to combine data from two large databases as part of an internship program (heavily focused on the learning experience, but not many people at the job are familiar with databases). The options are either to create a new table or database, or make a front-end that pulls data from both databases. Is it possible to just make a front-end for this? There is an issue of storage if a new database has to be created.
I'm still on the stage where I'm trying to figure out exactly how I'm going to go about doing this, and how to access the data in the first place. I have the table data for the two databases that are already in existence and I know what items need to be pulled from both. The end goal is a website where the user can input one of the values and output the all the information about that item. One of the databases is an Oracle database in SQL and the other is a Cisco Prime database. I am planning to work in Python if possible. Any guidance on this would be very helpful!

Yes, it is perfectly OK to access both data sources from a single frontend.
Having said that, it might be a problem if you need to combine data from both datasources in large quantities. Because you might have to reimplement some relational database functions such as join, sort, group by.
Python is perfectly OK to connect to Oracle DataSources. Not so sure about Cisco Prime (which is an unusual database).
I would recommend using Linux or Mac (not Windows) if you are new to Python, since both platforms are more python friendly than Windows.

How to instruct SQLAlchemy ORM to execute multiple queries in parallel when loading relationships?

I am using SQLAlchemy's ORM. I have a model that has multiple many-to-many relationships:
User
User <--MxN--> Organization
User <--MxN--> School
User <--MxN--> Credentials
I am implementing these using association tables, so there are also User_to_Organization, User_to_School and User_to_Credentials tables that I don't directly use.
Now, when I attempt to load a single User (using its PK identifier) and its relationships (and related models) using joined eager loading, I get horrible performance (15+ seconds). I assume this is due to this issue:
When multiple levels of depth are used with joined or subquery loading, loading collections-within- collections will multiply the total number of rows fetched in a cartesian fashion. Both forms of eager loading always join from the original parent class.
If I introduce another level or two to the hierarchy:
Organization <--1xN--> Project
School <--1xN--> Course
Project <--MxN--> Credentials
Course <--MxN--> Credentials
The query takes 50+ seconds to complete, even though the total amount of records in each table is fairly small.
Using lazy loading, I am required to manually load each relationship, and there are multiple round trips to the server.
e.g.
Operations, executed serially as queries:
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
For each Organization, get its Projects
For each School, get its Courses
For each Project, get its Credentials
For each Course, get its Credentials
Still, it all finishes in less than 200ms.
I was wondering if there is anyway to indeed use lazy loading, but perform the relationship loading queries in parallel. For example, using the concurrent module, asyncio or by using gevent.
e.g.
Step 1 (in parallel):
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
Step 2 (in parallel):
For each Organization, get its Projects
For each School, get its Courses
Step 3 (in parallel):
For each Project, get its Credentials
For each Course, get its Credentials
Actually, at this point, making a subquery type load can also work, that is, return Organization and OrganizationID/Project/Credentials in two separate queries:
e.g.
Step 1 (in parallel):
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
Step 2 (in parallel):
Get Organizations
Get Schools
Get the Organizations' Projects, join with Credentials
Get the Schools' Courses, join with Credentials

The first thing you're going to want to do is check to see what queries are actually being executed on the db. I wouldn't assume that SQLAlchemy is doing what you expect unless you're very familiar with it. You can use echo=True on your engine configuration or look at some db logs (not sure how to do that with mysql).
You've mentioned that you're using different loading strategies so I guess you've read through the docs on that (
http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html). For what you're doing, I'd probably recommend subquery load, but it totally depends on the number of rows / columns you're dealing with. In my experience it's a good general starting point though.
One thing to note, you might need to something like:
db.query(Thing).options(subqueryload('A').subqueryload('B')).filter(Thing.id==x).first()
With filter.first rather that get, as the latter case won't re-execute queries according to your loading strategy if the primary object is already in the identity map.
Finally, I don't know your data - but those numbers sound pretty abysmal for anything short of a huge data set. Check that you have the correct indexes specified on all your tables.
You may have already been through all of this, but based on the information you've provided, it sounds like you need to do more work to narrow down your issue. Is it the db schema, or is it the queries SQLA is executing?
Either way, I'd say, "no" to running multiple queries on different connections. Any attempt to do that could result in inconsistent data coming back to your app, and if you think you've got issues now..... :-)

MySQL has no parallelism in a single connection. For the ORM to do such would require multiple connections to MySQL. Generally, the overhead of trying to do such is "not worth it".
To get a user, his Organizations, Schools, etc, can all be done (in mysql) via a single query:
SELECT user, organization, ...
FROM Users
JOIN Organizations ON ...
etc.
This is significantly more efficient than
SELECT user FROM ...;
SELECT organization ... WHERE user = ...;
etc.
(This is not "parallelism".)
Or maybe your "steps" are not quite 'right'?...
SELECT user, organization, project
FROM Users
JOIN Organizations ...
JOIN Projects ...
That gets, in a single step, all users, together with all their organizations and projects.
But is a "user" associated with a "project"? If not, then this is the wrong approach.
If the ORM is not providing a mechanism to generate queries like those, than it is "getting in the way".

Importing a CSV file into a PostgreSQL DB using Python-Django

Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.

How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?

I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.

I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);

Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.